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ABSTRACT 



A method and system for computing 2-D DCT/IDCT which 
is easy to implement with VLSI technology to achieve high 
throughput to meet the requirements of high definition video 
processing in real time is described. A direct 2-D matrix 
factorization approach is utilized to compute the 2-D DCT/ 
IDCT. The 8x8 DCT/IDCT is computed through four 4x4 
matrix multiplication sub -blocks. Each sub -block is half the 
size of the original 8x8 size and therefore requires a much 
lower number of multiplications. Additionally, each sub- 
block can be implemented independently with localized 
interconnection so that parallelism can be exploited and a 
much higher DCT/IDCT throughput can be achieved. 

27 Claims, 29 Drawing Sheets 
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METHOD AND SYSTEM FOR COMPUTING Many efforts have been made. tryingUo* deliver or. store , 

8x8 DCT/IDCT AND A VLSI digital television signals, which have a bit-rate of more than 

IMPLEMENTATION 200 Mbit/s in an uncompressed format and must be brought 

down to a level that can be handled economically by current 

This application claims the benefit of provisional appli- 5 y [fe 0 processing technology. For example, suppose the 

cation Ser. No. 60/073,367 filed Feb. 2, 1998. pictures in a sequence are digitized as discrete grids or arrays 

BACKGROUND OF THE INVENTION with 360 pels (picture elements) per raster line, 288 lines/ 

* r- u r t picture (a typical resolution for MPEG-l video 

1. Field of the Invention r • \ if i * i j -.t. o 

m . . . , „ i , . compression), three-color separation and sampled with 8-bit 

Tliis invention rektes generally to calculating the M ision for each color> the ^ ^ video sequence 

2-Dimensional 8x8 Discrete Cosine Transform (2-DDCT) at 24 pictures/second is roughly 50 Mbit/s, and a one-minute 

and the Inverse Discrete Cosine Transform (2-D IDCT), and „«j m • „ AAQ *au„+~ aP „ t a 

t /T „ „ T . v . /y . video clip requires 448 Mbytes ot storage space, 

its very large scale integrated (VLSI) implementation. r \. / * r . f . 

Specifically, the present invention is well suited to meet the ^ International Standarchzat^n Organization ( SO) 

real time digital processing requirements of digital High- 1C sta f d lts moving picture standardization process in 1988 

Definition Television (HDTV) Wlth a slrong em P hasis on real-time decoding of compressed 

o r l A Art data store ^ on digital storage devices. A Moving Pictures 

2. Related Art Experts Group (MPEG) was formed in May 1988 and a 

OUTLINE OF RELATED ART SECTION consensus was reached to target the digital storage and 

• ftrj n a- a K/mnr- t i * .* real-time decoding of video with bit-rates around 1.5 Mbit/s 

1.0 Overview of Video Coding and MPEG Implementations „ /AxrtT:r , , 4 i\rwnt?pn a**u t^™^ u u 

& r 20 (MPEG-l protocol) [MPEG1]. At the MPEG meeting held 

1.1 Video Compression ^ Bedin Germany 0Q DeC ember 1990, a MPEG-2 proposal 

1.2 MPEG Video Compression: A Quick Look was presentcd mat primarily targeted for higher bit-rates, 

1.2.1 MPEG Video Sequences, Groups and Pictures larger picture sizes> and interlaced frames< ^ M PEG-2 

1.2.2 MPEG Video Slice, Macroblock and Block proposal attempted to address a much more broader set of 

1.2.3 The Motion Estimation/Compensation in MPEG 25 applications than M PEG-I (such as television broadcasting, 

1.2.4 The Discrete Cosine Transform in MPEG digital storage media> bigh-definition TV (HDTV) 

1.2.5 Hie Quantization in MPEG and video c0mmunicat i O n) while maintaining all of the 

1.2.6 The Zigzag Scan and Variable Length Coding in MPEG -i video syntax. Moreover, extensions were adopted 
MriiLj to add flexibility and functionality to the standard. Most 

1.2.7 MPEG Video Encoding Process 30 imponandy, a spatial scalable extension was added to allow 

1.2.8 MPEG Video Decoding Process video data streams ^ multiple resolutions to provide 

1.3 MPEG-l V^deo Standard support for both normal TV and HDTV Other scalable 

1.4 MPEG-2 Video Standard extensions allow the data stream to be partitioned into 

1.4.1 Fields, Frames and Pictures different layers in order to optimize transmission and recep- 

1.4.2 Chrominance Sampling 35 tion over existing and future networks [MPEG2]. 

1.4.3 Scalability overview of MPEG video compression techniques, 

1.4.4 Profiles and Levels MPEG-l's video layers and MPEG-2' s video layers is 

1.5 Hybrid Implementation Scheme for MTEG-2 Video presented in section 1.2, 1.3 and 1.4, respectively. A pro- 
System posed hybrid implementation scheme for MPEG-2 video 

2.0 DCT/IDCT Algorithms and Hardware Implementations 40 ccx jec is shown in section 1.5, An outline of rest of the thesis 

2.1 Introduction is presented in section 1.6. 

H V D , D 5 T/IDC I ^Sorithms and Implementations l 2 MpEG Vldeo Compression: A Quick Look 

2.2.1 Indirect 1-D DCT via Other Discrete Transforms r 

2.2.2 1-D DCT via Direct Factorizations A" MPEG codec specifically designed for com- 

2.2.3 1-D DCT Based on Recursive Algorithms 45 Passion of video sequences. Because a video sequence is 

2.2.4 1-D DCT/IDCT Hardware Implementations ^P 1 ? a 561165 of P»cturcs taken at closely spaced time 
2.3 2-D DCT/IDCT Algorithms and Implementations in ^ rvals ' these P ictu u res tend t0 ***** similar { <° m C J£ h 

2.3.1 2-D DCT via Other Discrete Transforms J™^*?*™]^ / scene J chan « e , P lace - ( The 

2.3.2 2-D DCT by Row-Column Method (RCM) M ^ Gl .^ MPEG2 codecs are designed to take advantage 

2.3.3 2-D DCT Based on Direct Matrix Factorization/ 50 of "^ff "»"W both put and future temporal infor- 
Decomposition mation (inter-frame coding). They also utilize commonality 

23.4 2-D DCT/IDCT Hardware Implementations * each / rame l such f. ^ 

v the bit-rate (intra-frame codmg) [MPEG1, MPEG2]. 

I A nummary 1 2 t MpEG vide0 Sequences, Groups and Pictures 

1.0 OVERVIEW OF VIDEO CODING AND 55 An MPEG video sequence is made up of individual 
MPEG IMPLEMENTATIONS pictures occurring at fixed time increments. Except for 
In this section, a brief overview of video compression, certain critical timing information in the MPEG systems 
Moving Pictures Experts Group (MPEG) video protocols * a y ers » aa MPEG video sequence bitstream is completely 
and different implementation approaches are presented. A semiconstrained and is independent of other video bit- 
list of references cited in this application is included in an 60 streams. 

Appendix. Each of these reference listed is incorporated Each video sequence is divided into one or more groups 

herein by reference in its entirety. of pictures, and each group of pictures is composed of one 

or more pictures of three different types: I-, P- and B-type. 

1.1 Video Compression I-pictures (intra-coded pictures) are coded independently, 

The reduction of transmission and storage requirements 65 entirely without reference to other pictures. P- and 

for digitized video signals has been a research and devel- B-pictures are compressed by coding the differences 

opment topic all over the world for more than 30 years. between the reference picture and the current one, thereby 
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exploiting the -similarities . from the current .to ..reference . .prediction to get the final result; The encoder, must follow the. 

picture to achieve high compression ratio. One example of same procedure when the reconstructed picture will be used 

a typical MPEG I-, P- and B-pictures arrangement in display for predicting other pictures. The vectors are the same for 

order is illustrated in FIG. 1. eac h P e l m a samQ macroblock, and the vector precision is 

The first coded picture in each video sequence must be an 5 either a m11 P el or a half-pel accuracy. 

I-picture. I-pictures may be occasionally inserted in different 12A ^ Discrete Cosine Transform! in MPEG 

positions of a video sequence to prevent the coding error ™? dlsCTCt <; cosme transform (DCI) is the critical part of 

propagation. For I-pictures, the coding method used by ^\^\ aDd mlc T codin S for ^ EG ^ m P ress ™- 

MPEG is similar to that defined by JPEG [JPEG]. ^ D <7 h t f certam P ro P ertie f 5 ^simplify coding models 

„ . „ , 4 . i _» • . N y . J t . Hrt and make the codmg more efficient in terms of perceptual 

P-pictures (predictive-coded pictures) obtain predictions 10 & r r 

c * n a- t r» • * *u quality measures. 

from temporally preceding I- ^ or P-pictures in the sequence Basically, the DCT is a method of decomposing the 

and B-pictures (bidirectionally predictrve-coded pictures) correlation of a 51ock of data int0 me spatial frequency 

obtain predictions from the nearest preceding and/or upcom- domain. The amplitude of each data in the spatial 

ing I- or P-pictures m the sequence. B-pictures may predict (coefficient) domain represents the contribution of that spa- 

from preceding pictures, upcoming pictures, both, or neither, is ^ frequency pattern in the block of data being analyzed. If 

Similarly, P-pictures may predict from a preceding picture or omy the low-frequency DCT coefficients are nonzero, the 

use intra-coding. data in the block vary slowly with position. If high frequen- 

A given sequence of pictures is encoded in a different cies are present, the block intensity changes rapidly from pel 

order which they are displayed when viewing the sequence. to pel. 

An example of the encoding sequence of MPEG I-, P- and 20 1.2.5 The Quantization in MPEG 

B-pictures is illustrated in FIG. 2. When the DCT is computed for a block of pels, it is 

Each component of a picture is made up of a two- desirable to represent the high spatial frequency coefficients 

dimensional (2-D) array of samples. Each horizontal line of with less precision and the low spatial frequency ones with 

samples in this 2-D grid is called a raster line, and each more precision. This is done by a process called quantiza- 

sample in a raster line is a digital representation of the 25 tion. A DCT coefficient is quantized by dividing it by a 

intensity of the component at that point on the raster line. For nonzero positive integer called a quantization value and 

color sequences, each picture has three components: a rounding it to the nearest integer. The bigger the quantiza- 

luminance component and two chrominance components. tion value is, the lower the precision is of the quantized DCT 

The luminance provides the intensity of the sample point, coefficient. Lower-precision coefficients can be transmitted 

whereas the two chrominance components express the 30 or stored with fewer bits. Generally speaking, the human eye 

equivalent of color hue and saturation at the sample point. is more sensitive to lower spatial frequency effects than 

They are mathematically equivalent to RGB primaries rep- higher ones, which is why the lower frequencies are quan- 

resentation but are better suited for efficient compression. tized with higher precision. 

RGB can be used if less efficient compression is acceptable. As noted above, a macroblock may be composed of four 

The equivalent counterpart of a picture in broadcast video 35 W blocks of luminance samples and two 8x8 blocks of 

(for example analog NTSQ) is a frame, which is further chrominance samples. A lower resolution is used here for the 

divided into two fields. Each field has half the raster lines of chrominance blocks because the human eye can resolve 

the full frame and the fields are interleaved such that higher spatial frequencies in luminance than in chromi- 

alternate raster lines in the frame belong to alternate fields. nance. 

1.2.2 MPEG Video Slice, Macroblock and Block 40 In intra coding, the DCT coefficients are almost com- 
The basic building block of an MPEG picture is the ' pletely decorrelated — that is, they are independent of one 

macroblock. The macroblock consists of one 16x16 array of another, and therefore can be coded independently. Decor- 
luminance samples phis one, two or four 8x8 blocks of relation is of great theoretical and practical interest in terms 
samples for each of two the chrominance components. The of construction of the coding model. The coding perfor- 
16x16 luminance array is actually composed of four 8x8 45 mance is also actually influenced profoundly by the visually- 
b locks of samples. The 8x8 block is the unit structure of the weighted quantization. 

MPEG video codec and is the quantity that is processed as In non-intra coding, the DCT does not greatly improve the 
an entity in the codec. decorrelation, since the difference signal obtained by sub- 
Each MPEG picture is composed of slices, where each tracting the prediction from the reference pictures is already 
slice is a contiguous sequence of macroblocks in raster scan 50 fairly well decorrelated. However, quantization is still a 
order. The slice starts at a specific address or position in the powerful compression technique for controlling the bit-rate, 
picture specified in the slice header. Slices can continue from even if decorrelation is not improved very much by the DCT. 
one macroblock row to the next in — MPEG — I, but not in Since the DCT coefficient properties are actually quite 
MPEG-2. different for intra and inter pictures, different quantization 

1.2.3 The Motion Estimation/Compensation in MPEG 55 tables are used for intra and inter coding. 

If there is motion in the sequence, a better prediction is 1.2.6 The Zigzag Scan and Variable Length Coding in 

often obtained by coding differences relative to reference MPEG 

areas that are shifted with respect to the area being coded; a The quantized 2-D DCT coefficients are arranged accord- 
process known as motion compensation. The process of ing to a 1-D sequence known as the zigzag scanning order, 
determining the motion vectors in the encoder is called 60 In most case, the scan orders the coefficients in ascending 
motion estimation, and the unit area being predicted is a spatial frequencies, which is illustrated in FIG. 1.3. By using 
macroblock. a quantization table which strongly deemphasizes higher 
The motion vectors describing the direction and amount spatial frequencies, only a few low-frequency coefficients 
of motion of the macroblocks are transmitted to the decoder are nonzero in a typical block which results in a very high 
as part of the bitstream. The decoder then knows which area 65 compression. 

of the reference picture was used for each prediction, and After the quantization, the 1-D sequence is coded loss- 
sums the decoded difference with this motion compensated lessly so that the decoder can reconstruct exactly the same 
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results;;, For-- MPEG,,, an .-^approximately. optimaUcoding, .. 1.3.MPEG4 AfideOiStandard..i.^ ; .. ^^r,^. 

technique, based on Huffman coding, is used to generate the _ m ^ t j j • - -i - . j j ^ 

tables of variable length codes needed for this talk. Variable A . ^PEG-1 video standard is primarily ^tended for 
length codes are needed to achieve good coding efficiency, digital storage application* such as compact disk (CD), 

as very short codes must be used for highly probable events. 5 DAT > and magnetic hard disks. It supports a continuous 
The run-length-coding and some special defined symbols transfer rate up to 1.5 Mbit/s, and is targeted for non- 
such as end-of-block, EOB) permit efficient coding of interlaced video formats having approximately 288 lines of 
DCTs with mostly zero coefficients. 3 ^2 pels and picture rates around 24 Hz to 30 Hz. The coded 

1.2.7 MPEG Video Encoding Process representation of MPEG-1 video supports normal speed 
The MPEG video encoding is a process that reads a forward playback, as well as special functions such as 

stream of input picture samples and produces a valid coded random access, fast play, fast reverse play, normal speed 
bitstream as defined in the specification. The high-level reverse playback, pause, and still pictures. The standard is 
coding system diagram shown in FIG. 4 illustrates the compatible with standard 525 and 625-line television 
structure of a typical encoder system 400. The -MPEG video formats, and it provides flexibility for use with personal 
divides the pictures in a sequence into three basic categories: computer and workstation displays [MPEG1]. 

I-, P- and B-pictures as described previously. « Eacfa iclurc of mpeg.! mnsisis of three rectangular"" 

Since I-pictures are coded without reference to neighbor- matrices of aumbefS . a ^^ukx matrix m and 

ing pictures in the sequence, the encoder only exploits the ^ chrominaQce malrices (Cb aQd Cr) . m Y . malrix must 
correlation within the picture. The incoming picture 405 will . i_ « j / j *u ^ a 

go directly through switch 410 into 2-D DCT module 420 to * ave an even number of rows and columns and the Cb and 

let the data in each block decomposed into underlying 20 ?" matrices are one half the size of the Y-matrix in both 
spatial frequencies. Since the response of the human visual horizontal and vertical dimensions. _ 
system is much more sensitive to low spatial frequencies The MPEG-1 video standard uses all the MPEG video 
than high ones, the frequencies are quantized with a quan- compression concepts and techniques listed in section 1.2. 
tization table with 64 entries in quantization module 430, in The MPEG-1 video standard only defines the video 

which each entry is a function of spatial frequency for each 25 bitstream, syntax and decoding specifications for the coded 
DCT coefficient. In zigzag scan module 470, the quantized video bitstream, and leaves a number of issues undefined in 
coefficients are then arranged qualitatively from low to high ^ encoding process, 
spatial frequency following a exact same or similar zigzag ^ 4 MPEG-2 Video Standard 

scan order shown in FIG. 3. The rearranged 1-D sequence 

data is further processed with an entropy coding (Huffman 30 The MPEG-2 video standard evolved from the MPEG-1 
coding) scheme to achieve further compression. video standard and is aimed at more diverse applications 
Simultaneously, the quantized coefficients are also used to such ™ television broadcasting, digital storage media, digital 
reconstruct the decoded blocks using inverse quantization high-definition television (HDTV), and communication 
(module 440) and an inverse 2-D DCT (reconstruction [MPEG2]. 

module 450). The reconstructed blocks stored in frame store 35 Additional requirements are added into the MPEG-2 
memory 455 is used as references for future differential video standard. It has to work across asynchronous transfer 
coding for P- and B-pictures. mode (ATM) networks and therefore needs improved error 

In contrast, P- and B-pictures are coded as the differences resilience and delay tolerance. It has to handle more pro- 
between the current macroblocks and the ones in preceding grams simultaneously without requiring a common time 

and/or upcoming reference pictures. If the image does not 40 base. It also has to be backwards compatible with the 
change much from one picture to the next, the difference will MPEG-1. Furthermore it is also targeted to code interlaced 
be insignificant and can be coded very effectively. If there is video signals, such as those used by the television industry, 
motion in the sequence, a better prediction can be obtained Much higher data transfer rates can be achieved by the 
from pels in the reference picture that are shifted relative to MPEG-2 system. 

me current picture pels (see, motion estimation module 460). 45 As a continuation of the original MPEG-1 standard, 
The differential results will be further compressed by a 2-D MPEG-2 borrows a significant portion of its technology and 
DCT, quantization, zigzag and variable length coding mod- terminology from MPEG-1. Both MPEG-2 and MPEG-1 
ules (420, 430, 470, 480) similar to the I-picture case. use the same layer structure concepts (i.e. sequence, group, 
Although the decorrelation is not improved much by the picture, slice, raacroblock, block, etc.). Both of them only 

DCT for the motion compensated case, the quantization is 50 specify the coded bitstream syntax and decoding operation, 
still an effective way to improve the compression rate. So Both of them invoke motion compensation to remove the 
MPEG's compression gain arises from three fundamental temporal redundancies and use the DCT coding to compress 
principles: prediction, decorrelation, and quantization. the spatial information. Also, the basic definitions of I-, P- 

1.2.8 MPEG Video Decoding Process and B-pictures remain the same in both standards. However, 
The MPEG video decoding process, which is the exact 55 the fixed eight bits of precision for the quantized DC 

inverse of the encoding process, is shown in FIG. 5. The coefficients, defined in the MPEG-1 is extended to three 
decoder 500 accepts the compressed video bitstream 485 choices in the MPEG-2: eight, nine and ten bits, 
generated from MPEG video encoder 400 and produces 1.4.1 Fields, Frames and Pictures 

output pictures 565 according to MPEG video syntax. At the higher bit-rates and picture rates that the MPEG-2 

The variable length decoding and inverse zigzag scan 60 video targets, fields and interlaced video become important, 
modules (51-, 520) reverse the results of the zigzag and The MPEG-2 video types are expanded from MPEG-l's I-, 
variable length coding to reconstruct the quantized DCT P- and B-pictures to I-field picture, I-frame picture, Meld 
coefficients. The inverse quantization and inverse 2-D DCT picture, P-frame picture, B-field picture, and B-frame pic- 
modules (530, 540) are exact the same modules as those in ture. 

the encoder. The motion compensation in motion compen- 65 In an interlaced analog frame composed of two fields, the 
sation module 550 will only be carried out for nonintra top field occurs earlier in time than the bottom field. In 
macroblocks in P- and B-pictures. MPEG-2, coded frames may be composed of any adjacent 
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u Apaiis .of. fields. A t coded,, Irframe * may .consist >qL a. I rf rame - , 
picture, a pair of I -field pictures, or an I-field picture 
followed by a Meld picture. A coded P-frame may consist of 
a P-frame picture or a pair of Meld pictures. A coded 
B-frame may consist of a B-frame picture or a pair of B-field 
pictures. In contrast to MPEG-I that allows only progressive 
pictures, MPEG-2 allows both interlaced and progressive 
pictures. 

1.4.2 Chrominance Sampling 

Comparing with MPEG-1 's single chrominance sampling 
format, MPEG-2 defines three chrominance sampling for- 
mats. These are labeled 4:2:0, 4:2:2 and 4:4:4. 

For 4:2:0 format, the chrominance is sampled 2:1 hori- 
zontally and vertically as in MPEG-1. For 4:2:2 format, the 
chrominance is subsampled 2:1 horizontally but not verti- 
cally. For 4:4:4 format, the chrominance has the same 
sampling for all three components and the decomposition 
into interlaced fields is the same for all three components. 

1.4.3 Scalability 

In order to cope with services like asynchronous transfer 
mode (ATM) networks and HDTV with conventional TV 
backward compatibility, more than one level of resolution 
and display quality are needed in the MPEG-2 video stan- 
dard. MPEG-2 has several types of scalability enhancements 
that allow low-resolution or smaller images to be decoded 
from only part of the bitstream. MPEG-2 coded images can 
be assembled into several layers. The standalone base layer 
may use the nonscalable MPEG-1 syntax. One or two 
enhancement layers are then used to get to the higher 
resolution or quality. This generally requires fewer bits than 
independent compressed images at each resolution and 
quality, and at the same time achieve higher error resilience 
for network transmission. 

There are four different scalability schemes in the 
MPEG-2 standard: 

SNR scalability uses the same luminance resolution in the 
lower layer and a single enhancement layer. The 
enhancement layer contains mainly coded DCT coef- 
ficients and a small overhead. In high-error transmis- 
sion environments, the base layer can be protected with 
- good error correcting techniques, while the enhance- 
ment layer is allowed to be less resilient to errors. 
Spatial scalability defines a base layer with a lower 
resolution and adds an enhancement layer to provide 
the additional resolution. In the enhancement layer, the 
difference between an interpolated version of the base 
layer and the source image is coded in order to accom- 
modate two applications with different resolution 
requirements like conventional TV and HDTV. 
Temporal scalability provides an extension to higher 
temporal picture rates while maintaining backward 
compatibility with lower-rate services. The lower tem- 
poral rate is coded by itself as the basic temporal rate. 
Then, additional pictures are coded using temporal 
prediction relative to the base layer. Some systems may 
decode both layers and multiplex the output to achieve 
the higher temporal rate. 
Data partitioning split the video bitstream into two chan- 
nels: the first one contains all of the key headers, 
motion vectors, and low-frequency DCT coefficients. 
The second one carries less critical information such as. 
high frequency DCT coefficients, possibly with less 
error protection. 

1.4.4 Profiles and Levels 

Profiles and levels provide a means of defining subsets of 
the syntax and semantics of MPEG-2 video specification and 
thereby give the decoder the information required to decode 
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a particular. bitstream.vA prpfile*is*a defined >sub^set of ,the-, j 
entire MPEG-2 bitstream. A level is a defined set of con- 
straints imposed on parameters in the bitstream. 

MPEG-2 defines five distinct profiles: simple profile (SP), 
main profile (MP), SNR scalable profile (SNR), spatial 
scalable profile (SPT) and high profile (HP). Four levels are 
also defined in MPEG-2: low (LL), main (ML), high-1440 
(H-14) and high (HL) to put constraints on some of the 
parameters in each profile because the parameter ranges are 
too large to insist on compliance over the full ranges even 
with the four profile subsets defined in the MPEG-2 video 
syntax. Only some of combinations among the profiles and 
levels are valid. The permissible level combinations with the 
main profile and their parameter values are listed in Table 
1.1. 



20 



TABLE 1.1 




Level definitions for nia^n, profile 




Level 


Parameters 


Bound 


High 


samples/tine 


1920 


(MP@HL) 


lines/frame 


1152 




frames/sec 


60 




luminance rate 


62,668,800 




bit-rate 


80 Mbit/s 


High-1440 


samplesAine 


1440 


(MP@H-14) 


lines/frame 


1152 




frames/sec 


6Q 




luminance rate 


47,001,600 




bit-rate 


60 Mbits/s 


Main 


samples/line 


720 


(MP@ML) 


lines/frame 


576 




framcs/scc 


30 




luminance rate 


10,368,000 




bit-rate 


15 Mbits/s 


Low 


samples/tine 


352 


(MP@LL) 


lines/frame 


288 




frames/sec 


30 




luminance rate 


3,041,280 




bit-rate 


4 Mbits/s 



40 



45 



50 



55 



60 



The permissible level/layer combinations with high pro- 
file and their parameter values are , listed in Table. 1.2. 



TABLE 1.2 





Level definitions for high profile 




Level 


Parameters 


Enh. Layer bound 


Base layer bound 


High 


samples/line 


1920 


960 


(HP@HL) 


lines/frame 


1152 


576 




frames/sec 


60 


30 




luminance rate 


83,558,400 


19,660,800 




bit-rate 


80 Mbits/s 


25 Mbits/s 


High- 1440 


samples/line 


1440 


720 


(HP@H-14) 


lines/frame 


1152 


576 




frames/sec 


60 


30 




luminance rate 


62,668,800 


14,745,600 




bit- rate 


60 Mbits/s 


20 Mbits/s 


Main 


samples/line 


720 


352 


(HP@ML) 


lines/frame 


576 


288 




frames/sec 


30 


30 




luminance rate 


14,745,600 


3,041,280 




bit-rate 


15 Mbits/s 


4 Mbits/s 



65 



1.5 Hybrid Implementation Scheme for MTEG-2 
Video System 

From FIGS. 4 and 5 in section 1.2 one can see that the 
encoding and decoding systems for MPEG video consist of 
several function modules. The modules can be classified by 
their computational requirements: 
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nature and are best suitable for implementation on a parallel 
structured hardware component. These modules include the 
2-D DCT, 2-D IDCT in both encoding/decoding processes, 
motion estimation, and motion compensation modules. 

(2) The computations carried out are serial in nature and 
can only be carried out with a serial structure. These 
modules include zigzag scan, inverse scan, variable length 
coding and variable length decoding modules. 

(3) The computations carried out are parallel in nature, but 
they can be easily carried out with serial structure without 
suffering much performance penalty. These modules include 
quantization and inverse quantization modules. 

So far there has been a lot of different approaches for the 
implementing a of MPEG video encoding/decoding system. 
Table 1.3 gives a brief summary of some MPEG-1 and 
MPEG-2 video codec system implementations from some 
major video vendors [Joa96]. 

TABLE 1.3 

MPEG vendors and products 



15 



20 



Vendor 


Profile 


Encoder included 


Product 


Array 


MPEG-1 


y 


H, S 


OCube 


MPEG-1, 2 


y 


H, Q B, S 


CompCorc 


MPEG-1, ML + 


a 


D,S 


Digital 


MPEG-1 


y 


S 


Future Tel 


MPEG-1 


y 


B 


Gt 


SP,MP,LL,ML 


y 


C, B, E 


HMS 


MPEG-1 


a 


S 


Hughes 


MPEG-1 


D 


B 


IBM 


MPEG-1, MP 


y 


QB 


Imedia 


MP@ML 


y 


H,S 


LSI 


MPEG-1, MP@ML 


n 


C, B 


Siemens 


MP,SNR 


y 


E 


Sun 


MPEG-1 


y 


S,B 


TI 


MPEG-1 


n 


S, B 



25 



30 



35 



Product codes area: H « hardware, S = software, B = boards, C = chips, 
E *» products 

One can see from Table 1.3 that all the MPEG-2 video 
codec implementations so far have been limited to main 
level MP@ML. For the MPEG-2 encoding process, the 
biggest obstacles for real-time encoding are motion estima- 
tion and 2-D DCT/IDCT. For the decoding process the 2-D 
IDCT is the most computation intensive task that every 
real-time decoding scheme needs, to overcome. The huge 
amount of computations required by motion estimation and 
2-D DCT/IDCT prevent the current hardware and software 
implementations of MPEG-2 video to move from MP@ML 
to higher levels. Table 1.4 shows just how computational 
intensive the 2-D IDCT is for MPEG-2 video decoding 
process. 

TABLE 1.4 

Upper bounds for total sample rate and 8x8 IDCT rate 



Profileflevel 


High 


High- 1440 


Main 


Low 


Simple 






SP@ML 




sample rate 






31,104,000 




8x8 IDCT/s 






486,000 




Main 


MP@HL 


MP@H-14 


MP@ML 


MP@LL 


sample rate 


188,006,400 


141,004,800 


31,104,000 


12,165,120 


8x8 IDCT/s 


2,937,600 


2,223,200 


486,000 


190,080 


SNR 






SNR@ML 


SNR@LL 


sample rate 






31,104,000 


12,165,120 


8x8 IDCT/s 






456,000 


190,080 



55 



60 



10 



TABLE 1.4-continued 



Upper bounds for total sample rate and 8x8 IDCT rate 



Profileflevel 



High 



High-1440 



Main 



Low 



10 



Spatial Spt<SH-14 

sample rate 86,054,400 

8x8 IDCT/s 1,344,600 

High HP@HL HP@H-14 HP@ML 

sample rate 154,828,800 116,121,600 26,680,320 

8x8 IDCT/s 2,419,200 1,814,400 416,880 



From Table 1.4, it is clear that the number of 2-D IDCTs 
in the decoding process will increase from only 486,000 8x8 
blocks per second for MP@ML to 2,937,600 blocks per 
second for MP@HL. Considering that most 8x8 2-D IDCT 
chips developed so far can carry out about 1,500,000 block 
transforms per second, and that the most powerful video 
digital signal processor (DSP) chip (TMS320C80 by TI) can 
only carry out 800,000 8x8 2-D IDCT per second [May95], 
a challenge for providing real-time MPEG-2 High-level 
hardware exists. 

In Section 2, existing 1-D DCT/IDCT and 2-D DCT 
IDCT algorithms, as well the hardware implementation of 
these algorithms are reviewed. It is shown that all the 
existing 2-D DCT/IDCT chip implementations have made 
use of the separability property of the 2-D DCT/IDCT since 
very simple communication interconnection can be achieved 
by this approach. The algorithms that require fewer multi- 
plications through direct matrix factorization/decomposition 
are not necessarily suitable for hardware implementation. 
Instead, the regularity of design and feasibility of layout 
implied by the row-column method seem to be the main 
concern for chip implementation. 

2.0 DCT/IDCT ALGORITHMS AND 
HARDWARE IMPLEMENTATIONS 

In this section, some of the most commonly used one- 
dimensional and two-dimensional Discrete Cosine Trans- 
40 form (DCT) and Inverse Discrete Cosine Transform (EDCT) 
algorithms are evaluated. Detailed implementation schemes 
of some algorithms are also presented. 

2.1 Introduction 

45 The development of fast algorithms for the Discrete 
Fourier Transform (DFT) by Cooley and Tukey [CT65] in 
1965 has led to phenomenal growth in its applications in 
digital signal processing. Similarly, the discovery of the 
Discrete Cosine Transform (DCT) in 1974 [ANR74] and its 
50 potential applications have caused a significant impact in 
audio and video signal processing. Since 1974, the DCT/ 
IDCT have been widely used in the image and speech data 
analysis, recognition and compression. They have become 
an integral part of several standards such as JPEG, MPEG, 
CCITT Recommendation H261 and other video conference 
protocols. 

A lot of fast algorithms and hardware architectures have 
been introduced for one-dimensional (1-D) and two- 
dimensional (2-D) DCT/IDCT computation. In section 2.2, 
an overview of major one-dimensional DCT/IDCT algo- 
rithms is presented. In section 23, the focus is on two- 
dimensional DCT/IDCT methods and their implementa- 
tions. A summary is presented in section 2.4. Some typical 
methods to demonstrate how the 1-D DCT or 2-D DCT 
65 computation can be simplified are discussed. These methods 
can also apply to the 1-D IDCT or 2-D IDCT computation 
in general. 
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Implementations as: 

Given N data point x(0),x(l), • • • , x(N-l), the 1-D v- t aA) 

N-point DCT and IDCT(or DCT-II and IDCT-II defined by 5 ZOO = £ . * = 0, 1 N - 1 

Wang) are defined as [Wan84]: "=° 

IT" £pf (2n+i)Jbr txi) The 2N-point sequence {y(n)} defined above can be used 

Xik) = V n C(A) Z, ^> C05 2A? to calculate the 2N-point DFT as: 

10 

/T^ 1 f2ii+njbr (2 - 2 > 2 H < 2 - 5 > 

^ = SNfj C{k)X(k)cos " w } Y(k) = J] A =0, 1 2* - 1 



where 

f 1/V2, * = o 
C(*H ' 

V 1, otherwise 



for k=0, 1, . . . , N-l. 

Intrinsically, for N-point data sequences, both 1-D DCT 
and 1-D IDCT require N 2 real multiplications and N(N-l) 
real additions/subtractions. In order to reduce the number of 
multiplications and additions/subtractions required, various 
fast algorithms have been developed for computing the 1-D 
DCT and 1-D IDCT. The development of efficient algo- 
rithms for the computation of DCT/IDCT began immedi- 
ately after Ahmed et al. reported their work on the DCT 
[ANR74]. 

One initial approach for the computation of DCT/IDCT is 
via Fourier Cosine Transform and its relations to the Dis- 
crete Fourier Transform (DFT) were exploited in the initial 
developments of its computational algorithms. The approach 
of computing the DCT/IDCT indirectly using the FFT is also 
borrowed by other researchers to obtain fast DCT/IDCT 
algorithms via other kinds of discrete transforms (such as 
Walsh-Hadamard Transform, Discrete Hartley Transform, 
etc.). 

In addition, fast DCT/IDCT algorithms can also be 
obtained by direct factorization of the DCT/IDCT coefficient 
matrices. When the components of this factorization are 
sparse, the decomposition represents a fast algorithm. Since 
the factorization is not unique, there exist a lot of different 
forms of fast algorithms. The factorization schemes often 
fall into the decimation-in-time (DVT) or the decimation- in - 
frequency (DEF) category [RY90}. 

Furthermore, there also exist other approaches to develop 
fast DCT/IDCT algorithms. The fast computation can be 
obtained through recursive computation [WC95, AZK95], 
planar rotations [LLM89], prime factor decomposition 
[YN85], filter-bank approach [Chi94] and Z-transform 
[SL96], etc, 

2.2.1 Indirect 1-D DCT via Other Discrete Transforms 

The Fourier Cosine Transform can be calculated using the 
Fourier Transform of an even function. Since there exist a lot 
of Fast Fourier Transform (FFT) algorithms, it is natural to 
first look at the existing FFT algorithms to compute DCT. 

Let x(0),x(l), . . . , x(N-l) be a given sequence. Then an 
extended sequence {y(n)}, which is symmetric about the 
(2N-l)/2 point, can be constructed as [RY90]: 

y{n) = x(n) n = 0. 1 N-i (2.3) 

= x{2N~n- I) n =N,N + l, ... , 2N - 1 



15 where W2N denotes exp(-j2n/2N). The above formula can 
be easily decomposed to 

N-l 2N-1 (2.3) 

20 n=0 

N-l 2N-1 

= £*(*)W& + £ *Ott -n - 

rt=0 n=W 

/i=0 n=0 
JV-1 

n=0 

Multiplying both sides of Eq. (2.6) by a factor of 

35 

where C(k) is defined in Eq. (2.1) and (2.2), we directly 
obtain the N-point DCT results as 

40 rr ^ (2*4-1)** rr i M (2J) 

* n-a * 

for k=0, 1, . . . , N-l. Thus, the N-point DCT X(k) can 
easily be calculated from 2N-point DFT Y(k) by mul- 
45 tiplying by the scale factor 




When the sequence {x(n)} is real, {y(n)} is real and 
symmetric. In this case, {Y(k)} I can be obtained via two 
N-point FFTs rather than by a single 2N-poinl FFT [Sor87, 
RY90]. Since an N-point FFT requires N log 2 N complex 
55 operations in general, the N-point DCT X(k) can be com- 
puted with 2N log 2 N complex operations plus the scaling 
with 

In the same spirit, the N-point DCT computation may also 
be calculated via other transforms such as Walsh-Hadamard 
Transform (WHT) [Ven88] for N^ 16 and Discrete Hartley 
65 Transform (DHT) [Mal87]. The WHT is known to be fast 
since the computation involves no multiplications. Thus an 
algorithm for DCT via WHT may well utilize this advantage. 
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detailed implementation of these two transforms can be [Wan83]. And only 
found in [RY90]. 

2.2.2 1-D DCT via Direct Factorizations 5 [ NloSlN ~T + 4 ) 



Consider the computation of the DCT of an input 
sequence {x(n)} I and let this sequence be represented by a 
(Nxl) column vector x, then the transformed sequence (io 
vector form) of DCT computation can be expressed in vector io 
notation as follows [RY90]: 



real multiplications and 



l)+2 



X(0) 



X(N-l) 



(2.8) 



15 



real additions are required by this approach. 

The key of this approach is that the is reduced in terms 
of A w . Take a 4-point sequence for example, the matrix 
can be decomposed as: 



A 4 : 



1 


0 


0 


cr 


' 1/vT 


l/VT 


0 


0 


1 


0 


0 


r 


0 


0 


0 


t 


ccs(^/4) 


cosC-r/4) 


0 


0 


0 


I 


1 


0 


0 


1 


0 


0 


0 


0 


-cos(/r/8) 


cos(3/r/8) 


0 


I 


-1 


0 


,0 


0 


1 


0. 


0 


0 


cos(3ff/8) 


cos(jt/S) . 


.1 


0 


0 


-1, 



(2-13) 



where is an NxN coefficient matrix and each element of where 



Oij = C(/)cos 



( 2j+l)fr 
2N 









0 


0 


0 




1 


0 


0 


1 


(2.14) 


(2-9) 30 




0 


0 


0 


1 


,B 4 = 


0 


1 


1 


0 








0 


1 


0 


0 




0 


I 


-1 


0 








.0 


0 


1 


0 






0 


0 


-1 





When the matrix A^ is factored into sparse matrices, the 
number of computations is reduced. 

One way to achieve a fast 1-D DCT computation by 
sparse matrix factorizations is as follows: Assume N is a 
power of 2, A^ can then be decomposed in the form 



M»/2 0 l 
A N =P N \ n B N 



fan 
In/2 



In/2 
-In/2 



nj = cos — i,j = 0, I , 



, tf/2-1 



(2.12> 



35 



(2.10) 



where A if/2 is the coefficient matrix for a N/2-point DCT; 
is a permutation matrix which permutes the even rows in 
increasing order in the top half and the odd rows in decreas- 
ing order in the bottom half; is a butterfly matrix which 
can be expressed in terms of the identity matrix I m and the 
opposite identity matrix \ Nn (i.e. the elements position on 
the opposite diagonal are equal to 1, others are 0) as follows: 



50 



(2.11) 



R w is the remaining (N/2xN/2) block in the factor matrix 55 
which can be obtained by reversing the orders of both the 
rows and columns of an intermediate matrix where the 
definition of each element of R N r> is: 



A 2 : 



1/V2 i/vT 

cos(>r/4) cos(3rr/4) 



-cos(/r/8) cos(3*/8)l 
cos(/r/4) cos(3ff/4)J 



Alternatively, some factorization schemes have adopted 
decimation- in -time (DIT) or decimation-in -frequency (DIF) 
approach, which achieve fast computation through rearrang- 
ing the input sequence {x(n)} or output sequence {X(k)}, 
respectively. 

Looking at the DIT approach for example. If the scale 
factors in Eq. (2.1) are left out for convenience, the trans- 
formed sequence X(k) can be expressed as: 



The factorization of Eq. (2.10) is only partly recursive 
because the matrix can not be recursively factored. 65 
However, there is regularity in its factorization, where it can 
be decomposed into five types of inatrixfactors and all of 



X(k) = ^jtCOcos 



(2/1 4- l)fC7T 

~2N ' 



(2-15) 



k = 0, 1, . 



There are two steps in the DIT approach, and their 
objective is to reduce an N-point DCT to an N/2-point DCT 
by permutation of the input sample points in the time 
domain. The first step in the DIT algorithm consists of a 
rearrangement of the input sample points. The second step 
reduces the N-pomit transform to two N/2-point transforms 
to establish the recursive aspect of the algorithm [RY90J. 

Dee fining 



Zk/vr 
[x(2n)+jf(2rt-l)]cos— and 

Hik) = £ \ X Qn) + xOn + l)]cos<5^, 



60 



(2-16) 



n-0 

*=0. 1, 



,N/2-\ 



with x(-l)=x(N)=0 as the initial conditions for x(n). 
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ta-w^-i i - ; Using. the .properties of » the .cosine^ fuDctionSjiuis.easylo^t; .-uv.- 
see that Eq. (2.16) can be substituted into Eq. (2.15), 
resulting in: 



v _ 1 [<?(*) . (2.17) 

X{k) = — — -— and 

2 cos(kx(N) 

v r kj y U If C(* +!)+//(* 4-1) 

X[N-k-l)=- [ CQ%{{N _ k _ l)jll2N y 
A =0,1 Nf2-l 



A(A,«)=LA(A,«-l)+^(«)P(A,/t) 



(2.23) 



15 



In Eq. (2.16), the sequence {H(k)} is obtained as a DCT 
with N/2 sample points and {G(k)} is obtained as a DCT 
with (N/2+1) sample points. Each of these smaller transform 
can be further reduced, which leads to the desired recursive 
structure. Excluding scaling and normalization, it is found 
that for an N-point (N being power of 2) sequence {x(n)} s 
the DIT algorithm for DCT requires ((N/2)log 2 N+N/4) real 
multiplications and ((3N/2-l)log 2 N+N/4+1) real additions 
[RY90]. 

When the rearrangement of the sample points results in 
the transformed sequence being grouped into even- and 
odd-frequency indexed portions, the decomposition is said 
to constitute a DIF algorithm. For an N-point (radix-2) 
sequence {x(n)}, the DIF algorithm for DCT requires (N/2) 
Iog 2 N real multiplications and ((3N/2)log 2 N-N+l) real 
additions [RY90]. 

2.2.3 1-D DCT Based on Recursive Algorithms 

In addition to the algorithms described in the previous 
sections, there exist many more other different kind 
approaches. Using some well-known recursive algorithms to 
compute DCT/IDCT is one of them which can achieve the 
goal of fast computation. Two typical ones are shown here: 
Chebyshev Polynomial recurrence and Clenshaw's recur- 35 
rence formula. 

One fast recursive algorithm for computing the DCT 
based on the Chebyshev Polynomial factorization is pro- 
posed by Wang and Chen [WC95]. Recall the following 
trigonometric identity: 4q 



5 where X(k)-A(k,N-l), k=0, 1, . . . , N-l. Thus the X(k) can 
be calculated in N recursive steps from the input sequence 
x(n) using Eq. (2.20) and (2.21). For an N-point sequence 
{x(n)}, this recursive algorithm requires 2N(N-1) real mul- 
tiplications and real additions. 
10 In addition, Aburdene et al. proposed another fast recur- 
sive algorithm for computing the DCT based on the Clen- 
shaw's (or Goertzers as called in other papers) recurrence 
formula [AZK95]. The Clenshaw's recurrence formula 
states that considering a linear combination of the form 



(2.22) 



20 in which F(x,n) obeys a recurrence relation 



(2.23) 



for some functions a(x,n) and (3(x,n), then the sum f(x) can 
25 be computed as 



(2.24) 



where {ip(n)} can be obtained from the following recurrence 
relations: 



H>(-2>^(-l)=0 and 
rt-0, 1,...,N-1 

Defining 

\ k ~hUN and 

F(\ to n)=cos(Cn+V4)(A3^V)>-cos[Crt+l/2)XJ 



(2.25) 



(2.26) 



cosfya^ cos a cos[(y-l)a]-cos[(7-2)a] 



(2.18) the 1-D DCT can be expressed as 



/V-l 

fM = X(k) = Y i x{n)F(X iy n), k = 0,1 

n=0 



(2-27) 



which, by the way, can be proved using the Chebyshev 
polynomial. If one leaves out the scale factors in Eq. (2.1) 4S 
for convenience (i.e. use Eq. (2.15) as the definition of the 
1-D DCT) and define the recursive variables as 

The calculation of F^^n) can be made recursively using 
(2.19) the identity 
50 

(2n + l)!ot ^ 4 cos[(n+3/2)^J=2cos(X^sI(m-l/2)A. A ]-cos[(n-l/2)Xj (2.28) 



(2rt+l)br (2/t+ l>r 
P(k, n) = cos — — = cos , a - kx/N and 



2N 



to generate the recurrence expression for F(X*, n+1) as 



the 1-D DCT can be computed using the following Cheby- 55 
shev polynomial recurrence: 

P(A,-l)=P(*,0)-cos(a/2) &nd 

P(k, n+l)=2 oo^a^nyPik, n-l) (2.20) 



cas(X^FQ^ Bj-ftV-J) 



(2.29) 



Comparing Eq. (2.23) and (2.29), one can see that the 
terms a(x,n) and p(x,n) in Eq. (2.23) should be chosen as 2 
cos (\ k y and -1. 

Substitute Eq. (2.24) in Eq. (2.27), we can find 



X(k) = x(N - 1 )F(Xt ,/V-l)-F(A t ,V- 2tf(N - 2) - F{A k , N - IfflN - 3) (2.30) 
= <-l/[jt(tf - l)cos{\ k 12) + cos^ /2J«N -2) -cosU, /2)^(N - 3)] 
= (-l)'cos(A* /2X«N - l)-WN - 2)] 
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v.where .o|i(n) * is -obtained from.-Eq.*. (2.25) . as .*.■.**:[■« t »»■ r , 

i|)(-2)=i|)(-l)=0 and 

i|>(rt)-2 cos(Xi>i)(tt-l)-^(n-2)+xCn), n-0, 1, . . . , N-l (2.31) 

Thus, \|)(n) can be recursively generated from the input 
sequence x(n) according Eq. (2.31). And at the Nth step, 
X(k) can be evaluated by Eq. (2.30) for k=0, 1, . . . , N-l. 
For an N-point sequence {x(n)} } this recursive algorithm 
requires about N 2 real multiplications and real additions. 
2.2.4 1-D DCT/IDCT Hardware Implementations 

The algorithms that compute the DCT/IDCT indirectly 
via other discrete transforms are normally not the good 
candidate for hardware implementation. The conversion 
between the input and output data of two different trans- 
forms is generally complicated. Many transforms, like FFT 
and WHT, use complex architectures, which make the 
hardware implementations of the 1-D DCT even less effi- 
cient. The advantage of computing the 1-D DCT via DFT is 
that the standard FFT routines and implementations are 
available that can be directly used in the DCT/IDCT. 

TABLE 2.1 

Summary of some 1-D DCT algorithms 







Number of 


metic 


connection 


Algorithm 


1-D DCT via 


Multiplications 


Types 


Complexity 


[Har76] 


DFT/FFT 




Com- 


High 








plex 




[Wan83] 


Direct Factor. 


NlofoN - 3W2 + 4 


Real 


M:ry High 


[RY90J 


Factor./DlT 


(N/2)log 2 N + N/4 


Real 


High 


[WC95] 


Recursive 


2N(N - 1) 


Real 


Low 


[AZK95] 


Recursive 


N 2 


Real 


Low 



20 



25 



30 



The algorithms that compute the DCT/IDCT via direct 
factorizations have the advantages that they are reasonably 
fast and recursive in some degree. These algorithms make 
full use of the sparseness of the DCT/IDCT coefficient 
matrix and require much fewer multiplications and 
additions/subtractions. But the complicated index mapping 
of global interconnection from the input and to the output 
data makes the hardware implementations rather difficult. 

Alternatively, although the DCT/IDCT algorithms based 
on recursive approaches do not necessarily use fewer opera- 
tions than other discrete transforms, the recursive nature 
makes them easy to be implemented with relatively simple 
processing elements (PE) and simple interconnections 
among the PEs. Identical or similar structured PEs in a 
hardware implementation can greatly reduce the cost of the 
design and layout process. It has been shown that time 
recursive algorithms and the resulting DCT/IDCT architec- 
tures are well suited for VLSI implementation. 

One of the recursive schemes that can be easily adopted 
for the 1-D DCT hardware implementation is the Chebyshev 
polynomial method (described in section 2.2.3). The basic 
function cell to compute the 1-D DCT based on this method 
is shown in FIG. 6 [WC95]. For N-point input sequence, 
total N cells are required for k=0, 1, . . . , N-l. Since these 
N cells have identical structure, functional design and layout 
cost can be reduced correspondingly. 

Another example of the 1-D DCT hardware implementa- 
tion using recursive scheme is based on Clenshaw's recur- 
rence formula (described in section 2.2.4). The hardware 
structure of the implementation is shown in FIG. 7 
[AZK95]. 

2.3 2-D DCT/IDCT Algorithms and 
Implementations 

Similar to the definitions of the 1-D DCT/IDCT, the 
forward and inverse 2-D Discrete Cosine Transform (2-D 



35 



40 
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. DCT/IDCT) of an -input, sequence vx(m,n),s.0^m,n<N v -are : . 
defined as: 



*(A, 0= ^C(A)C(0XX* (w ' l,)cos 1N cos ^ 

m=0 n=0 



(232) 



where 



10 



15 



4m, n) = - V y C(k)C(l)X(k, Ocos 



(2m+l)feff (2n+l)bi 



(2.33) 



2N 



where C(A), C{1) = 



kj=0 
otherwise 



For an NxN point input sequence, both the 2-D DCT and 
2-D IDCT require 0(N 4 ) real multiplications and corre- 
sponding additions/subtractions, assuming the computations 
are carried out by brute force. In order to improve the 
efficiency of 2-D DCT and 2-D IDCT computations, various 
fast computational algorithms and corresponding architec- 
tures have been proposed. In general, all of these algorithms 
can be broadly classified into 3 basic categories: 1) compute 
the 2-D DCT/IDCT indirectly via other discrete fast 
transforms, 2) decompose the 2-D DCT/IDCT into two 1-D 
DCT/IDCrs, and 3) compute the 2-D DCT/IDCT based on 
direct matrix factorization or decomposition. 

Computation of the 2-D DCT/IDCT via other discrete fast 
transforms manages to take advantage of the existence of 
other kinds 2-D discrete transform algorithms and architec- 
tures. The best candidates that can be employed to perform 
the 2-D DCT/IDCT, for example, are the 2-D FFT and 2-D 
WHT [NK83, Vet85]. 

However, the decomposition of a 2-D DCT/IDCT into 
two 1-D DCT/IDCTs, which conventionally is also called 
the Row-Column Method (RCM), evaluates the 1-D DCT/ 
IDCT in row-column- wise or column-row-wise form. That 
is, it starts by processing the row (or column) elements of 
input data block as a 1-D DCT/IDCT and store the results in 
an intermediate memory; it then processes the transposed 
column (or row) elements of the intermediate results to 
further yield the 2-D DCT/IDCT results [CW95, SL96, 
MW95, Jan94]. Since the RCM reduces the 2-D DCT into 
two separate 1-D DCTs, existing 1-D algorithms listed in 
section 2.2 can be directly used so that the computational 
complexity can be simplified. 

The direct 2-D factorization methods work directly on the 
2-D data set and coefficient matrices. This kind of approach 
mainly concentrates on reducing the redundancy within the 
2-D DCT/IDCT computations so that much fewer multipli- 
cations would be required [DG90, CL91, Lee97]. 
2.3.1 2-D DCT via Other Discrete Transforms 

The close relationship between the DCT and the DFT can 
also be exploited in the two-dimensional case. 

As shown by Nasrabadi and King [NK83], a rearrange- 
ment of the input matrix elements easily leads to expressions 
involving evaluation of two-dimensional DFIs. Leaving the 
scale factors out of Eq. (2.32) and Eq. (2.33), and treating 
x(m,n) and X(kJ) as scaled and normalized two-dimensional 
input and output data as 



^ ^ (2m 4- l)kx (2/i -t 
65 *(*. 0 = 2j Zj n)c0S 2N raS — 2 

m=0 «=£> 



\)bt 



(2.34) 
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if ^-continued viu "^ ,it * A ' t,, * r 



*(/*,n) = ££*(*, /)cos 

4=0 i=0 



(2m+l)fcjr (2/i + l)ta 
cos- 



(2.35) 



2/V 



2N 



Define an intermediate NxN transform sequence 

}>(m,n)-x(2m,2/i), m,n-Q, 1, . . . , A//2-1 
Km,n)-Jc(2y-2m-l», m-W/2, . . . , n-0, 1, . . , AT/2-1 

. . , JV-1 



10 



>(m,n)=x(2m ,2^-2/1-1), m=0, 1, A72-1, rc^V/2, 

aad 



where the 2-D DFT of y(m,n) can be calculated as: 



V-l N-l 
m-0 n-0 



(2.37) 



20 



and the W^* denotes exp(-j2kn/N). 

Furthermore, using a simple compound angle formula for 
the cosine functions, it is possible to derive the following 
similarly to Eq. (2.7) as 



(2.38) 



Above Eq. (2.38) is sometimes referred to as representing 
"phasor-modified" DFT components. And it can be further 
simplified as 



where 



X{k, 0=(WVte{A(*i/)+M(^-i)} 



(2-39) 



(2.40) 



25 



30 



Since Y(k,l) is the 2-D DFT, its implementation can be 
realized by using any of the available 2-D algorithms. One 
of the most efficient methods proposed by Nussbaumer is to 
compute the 2-D real DFT by means of the polynomial 
transforms [Nus81]. The reduction in computational com- 
plexity is obtained by mapping the DFT on the index m to 
polynomial transform. Overall, an NxN point DCT is 
mapped onto N DFTs of lengths N. For real NxN input 
sequence {x(rn,n)}, the 2-D DCT requires ((N 2 /2-l)log ; , 
N+N 2 /3-2N-8/3) complex multiplications and ((5N 2 /2)log 5 
N+N 2 /3-6N-62/3) complex additions. 

Besides, the 2-D DCT can also be carried out via the 2-D 
Watsh-Hadamard Transform (WHT) [Vet85]. 
2.3.2 2-D DCT by Row-Column Method (RCM) 

Like some other discrete transforms, such as DFT, WHT, 
ST, HT, etc., the 2-D DCT is a separable transform. And Eq. 
(2.32) can also be expressed as 



20 



Ther inner summation., .,, v .* 

(2/1+1)/* 



/~2~ N ~ l 



n)cos- 



1N 



is an N-point 1-D DCT of the rows of x(m,n), whereas the 
outer summation represents the N-point 1-D DCT of the 
columns of the "semi-transformed'* matrix, whose 
elements are 



(2.36) 1S 



( 2*n-l)fa 
IN ' 



where m,l=0, 1, . . . , N-l. 

This implies that a 2-D NxN DCT can be implemented by 
N*s N-point DCTs along the columns of x(m,n), followed by 
N's N-point DCTs along the rows of the results after the 
column transformations. In practice, the order in which the 
row transform and the column transform are done is theo- 
retically immaterial. 

All 1-D DCT fast algorithms discussed in section 2.2 can 
be used here to simplify the 2-D DCT computation, which 
requires totally 2N*s 1-D DCTs, For example, if the 1-D 
DCT is carried out via the 1-D FFT, approximate 2Nx(2N 
log 2 N) complex operations plus the scaling are required. 
2.3.3 2-D DCT Based on Direct Matrix Factorization/ 
Decomposition 

In the RCM, the computation reduction applies only to 
one 1-D array at a time. That makes these algorithms less 
efficient and not quite modular in structure. Haque reported 
35 a 2-D fast DCT algorithm based on a rearrangement of the 
elements of the two-dimensional input matrix into a block 
matrix form [Haq851. Each block of the matrix is then 
calculated via a "half-size" 2-D DCT. 

The NxN DCT block decomposition of Eq. (2.34) is based 
40 uponjhe following procedures: 

(1) Decompose the NxN input data x(m,/i) into four 
(N/2)x(N/2) sub-blocks: 



45 



50 



55 



/7(m,«)=x(m,/i) ) q(m, n)=x(m > A r -/T~ 1 ), 



(2.42) 



(2) Arrange the sub-blocks as a block column vector and 
multiply by a 4x4 block Walsh-Hadamard matrix H to 
get four (N/2)x(N/2) sub-blocks: 

(2-43) 







ii i i 


[p(m. n)} 






! -1 1 -f 


[q{m t n)] 


\gn 




f I -/ -/ 


[r(m,n)] 






/ -/ -/ / 


. [J(m, rt)l . 



Qm + \)hx 
'—IN— 



(2.41) 



where t, 1=0, 1 N-l. 
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,.(3).Scale these siib-block&witha diagonal sca!e*matrix.W> t ;1 IDGT.;iMost -of the. works .have.., concentraledton^the-basic ^ 

block size of W, because the W block has been found to be 



(144) 



where W-(l/2) diag [l/cos(jiy2N), Vcos(3jtf2N), . . . , 
l/cos((N-l)^2N)]. 

(4) Compute the four (N/2)x(N/2) 2-D DCTs of the scaled 
sub-blocks [gl], [g2], [g3] and [g4] to obtain the results 
[Gl(k,l)], [G2(k,l)], [G3(k,l)] and [G4(k,l)]. 

(5) Denote the even-even, even-odd, odd-even and odd- 
odd elements of the transformed matrix [X(k,l)] in Eq. 
(2.35) by four (N/2)x(N/2) sub-blocks: 



(6) Convert [Gl(k,l)l [G2(kJ)], [G3(k,l)] and [G4(k,l)] to 
[P(k,l)], [Q(k,l)], [R(kji)] and [S(k,l)] by: 



(2.46) 



where L is an (N/2)x(N/2) lower triangular matrix of 
the form 



L = 



1 0 0 
-1 1 0 



... 0 

... 0 

... 0 

-1 1 



(2.47) 



The computation of the 2-D DCT based on the Haque's 
algorithm requires ((3/4)^ log 2 N) multiplications and (3N 2 
log 2 N-2N 2 +2N) additions [Haq85, RY90]. 

As an alternation, Cho and Lee proposed another 
approach for decomposing a 2-D DCT [CL91]. Using the 
following trigonometric relation 



{2m + l>br (2ra+l)f* 

cos cos = 

2N 2N 



C2-48) 



St 



(2m + l>br - (2n + t)iwl 

s w J 



the 2-D DCT in Eq. (2.34) can be rewritten as 
where 



(2-49) 



(2m + i)kn + (2« +• l)6r 

(2m + \)k7T-(ln+ \)Ln 
2N 



(2.50) 



and 



After some complicated data reordering and 
manipulations, Cho and Lee have shown that A(k,l) and 
B(k,l) can be expressed in terms of N's 1-D DCTs so that an 
NxN DCT can be obtained from N's separate 1-D DCTs 
[CL91]. 

2.3.4 2-D DCT/IDCT Hardware Implementations 

A lot of papers have been written lately on the develop- 
ment of VLSI and chip implementation of the 2-D DCT/ 



able to provide sufficient details and localized activities of 
the image such that it has been adopted as the standard 2-D 
DCT/IDCT size in almost all existing image and video 
processing and compression protocols. 

TABLE 2.2 



10 



15 



(2.45) 
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Algorithm 


2-D DCT 
via 


Number of 
Multiplications 


Arithmetic 
Types 


Inter- 
connection 
Complexity 


[Nau81] 


DFr/FFT 


(NP/2 - 1) log 2 N + 


Complex 


High 






N 2 /3 - 2N - 8/3 






[WC95] 


RCM/ 


4 N 2 (N - 1) 


Real 


Low 




Recursive 








[CSF77] 


RCM/ 


N3 


Real 


Medium 




Factorization 








[CL91] 


Direct 2-D 


N3 


Real 


Very High 




Factorization 








[Haq85] 


Direct 2-D 


(3/4) N 2 log 2 N 


Real 


Very High 




Factorization 









25 



30 



35 



40 



Because of the limitation of areas and interconnections in 
VLSI implementation, not much of the chip development 
work has included the mapping of fast, two-dimensional 
algorithms onto silicon directly. Instead, regularity of design 
and feasibility of layout seem to be the primary concern, 
together with a realistic throughput rate for real-time appli- 
cations. However, there have been attempts to map Lee's 
algorithm [Lee84] onto silicon [RY90]. As well, chips based 
on a single processor rotation [LLM89] are also being 
reported [RY90]. But all of them are limited to 1-D DCT/ 
IDCT applications. 

In practice, the 2-D DCT algorithms based on other 
discrete transforms suffer the same setbacks as their 1-D 
counterparts: complex arithmetic operations, complicated 
conversion between the two different transforms and com- 
plex index mapping, which make the hardware implemen- 
tations via other discrete transforms rather difficult. 

Generally speaking, the 2-D DCT algorithms based on 
direct matrix factorization or decomposition are much more 
suitable for software implementation, because they usually 
require fewer multiplications than other approaches and the 
complex index mapping involved is not a problem for 
45 software. The high communication complexity and global 
interconnection involved in these algorithms make them 
difficult to be implemented using VLSI technology. 

The 2-D DCT algorithms based on the RCM approach, 
however, can be realized using a very simple and regular 
50 structure, since the RCM reduces the 2-D DCT into two 
stages of 1-D DCTs and the existing 1-D DCT algorithms 
listed in section 2.2 can be employed directly. The relative 
simple localized interconnections of the RCM is another key 
feature making it suitable for VLSI implementation. The 
55 block diagram of the "row-column* 1 transform approach to 
realize an NxN 2-D DCT is illustrated in FIG. 8. 

Needless to say, variations on this basic block structure 
are many. Some use special devices for the intermediate 
memory transposition operation. Some use a single, 1-D 
60 DCT processor to perform both row and column transfor- 
mations one-by-one in order to reduce the die size [MW95]. 
Others use time-recursive algorithms and architectures to 
achieve regular and modular structure [SL96, WC95]. Some 
proposed systolic array architecture of RCM can even avoid 
65 using an intermediate matrix transposition circuitry with the 
extra expense of data synchronization and input sequence 
reordering [CW95]. 
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;.*.. i*U; is worth. notice. that allahe.chip developments.haveLonewr.o..-.- „FIG. 3. is a diagram of a-zig^zag^scaaning*orderiofcOGTw, 

common ground. Almost all the 2-D DCT or IDCT proces- scanning coefficients. 

sors developed so far have made use of the separability FIG. 4 is a diagram of an example simplified MPEG video 

property of the 2-D DCT or IDCT by decomposing it into encoding process. 

two separate 1-D transforms. None have attempted to 5 FIG. 5 is a diagram of an example simplified MPEG video 

directly map a specific 2-D DCT or IDCT algorithm to decoding process. 

silicon. FIG. 6 is a block diagram of an example processing 

element cell k for a Chebyshev polynomial recurrence, 

ummary 7 { s a block diagram of a recursive implementation 

In this section, various conventional approaches for com- 10 of 1-D DCT based on Clenshaw's formula, 

puting the 1-D DCT have been examined, as well as some FIG. 8 is a block diagram of a row-column approach for 

of the algorithms designed to implement the 2-D DCT have performing a 2-D DCT. 

also been investigated. The 1-D algorithms can be loosely FIG. 9 shows five graphs of an example 2-D DCT 

classified as the DCT via other transforms, via sparse matrix simulation with finite coefficient word lengths according to 

factorization and via time-recursive approaches. Similar, the is me p rcseQ t invention. 

2-D algorithms can be classified as DCT via other F t G 10 sn0 ws five graphs of an example 2-D IDCT 

transforms, via direct matrix factorization/decomposition simulation with finite coefficient word lengths according to 

and via Row-Column methods. Both the 1-D IDCT and 2-D me p rcsen t invention. 

IDCT can be computed and implemented with approaches FIG. 11 shows five graphs of an example 2-D DCT 

similar to the 1-D DCT and 2-D DCT. 20 simulation ^ finitc lengths accorrfing to lhe 

The most prominent property is the separability property present invention, 

of the 2-D DCT or IDCT, which has been exploited both in FIG u shows five graphs of m cxample 2 -D IDCT 

the algorithms and in the chip designs. Almost all existing simulation with finite truncation lengths according to the 

2-D DCT or IDCT processors are based on the reduction of present invention 

2-D DCT or IDCT to a lexicographically ordered 1-D « piG. 13 is a block diagram of DCT data flow according to 

transforms (us. RCM). an embodimcat of the prescnt Nation. 

Cbmparedwith the Row-Column methods, the direct 2-D p, G u ^ a b|ock diagram of IDC T data flow according 

DCT or IDCT main* factorization/decomposition is more to em5odiment of the nt mvent ion. 

computation efficent and generally requires fewer multiph- ^ 15 ^ a di of ^ le shuffler data structure 

cations. But the complex global communication intercon- , * . „• 

• f . * ^ r^r^r ir^r-r i >u u accordmg to the present invention, 

nection or existing direct 2-D DCT or IDCT algorithms has ° . r „ , , . , , 

prevented them from being implemented in VLSI chips due . nG - u 16 15 a °* an exam P Ie EE-sub-block accord- 

to design and layout concerns. m « t0 lne P resent inventl0n - 

FIG. 17 is a diagram of example latching, multiplexing, 

SUMMARY OF THE INVENTION 35 and clipping stages for respective 2-D DCT of EE, EO, OE, 

Hie present invention provides a method and system for and 00 sub-blocks according to an embodiment of the 

computing 2-D DCT/IDCT which is easy to implement with P resent inventloa - 

VLSI technology to achieve high throughput to meet the FIG - 18 15 a diagram of an example architecture and data 

requirements of high definition video processing in real for a 2 " D Dcr according to an embodiment of the present 

time. 40 invention. 

The present invention is based on a direct 2-D matrix FIG - 19 fc a dia S ram of ™ exam P le architecture and data 

factorization approach. The present invention computes the for a 2 ; D IDCT according to an embodiment of the present 

8x8 DCT/IDCT through four 4x4 matrix multiplication invention. 

sub-blocks. Each sub-block is half the size of the original 4S FIG. 20 is a diagram of an example combined architecture 

8x8 size and therefore requires a much lower number of and data for 2 " D DCT and 2 " D rocT according to an 

multipUcations. Additionally, each sub-block can be imple- embodiment of the present invention, 

mented independently with localized interconnection so that FIG. 21 is a flowchart of a synthesis approach according 

parallelism can be exploited and a much higher DCT/IDCT to an embodiment of the present invention, 

throughput can be achieved. SQ FIG. 22 is a timing diagram illustrating waveforms of 2-D 

Further embodiments, features, and advantages of the DCT input/output and handshaking signals for an example 

present inventions, as well as the structure and operation of VLSI implementation according to the present invention, 

the various embodiments of the present invention, are FIG. 23 is a timing diagram illustrating waveforms of 2-D 

described in detail below with reference to the accompany- IDCT input/output and handshaking signals for an example 

ing drawings. 55 VLSI implementation according to the present invention. 

The present invention will now be described with refer- 
ence to the accompanying drawings. In the drawings, like 

The accompanying drawings, which are incorporated reference numbers indicate identical or functionally similar 

herein and form a part of the specification, illustrate the elements. Additionally, the left-most digitus) of a reference 

present invention and, together with the description, further 60 number identifies the drawing in which the reference num- 

serve to explain the principles of the invention and to enable ber first appears. 

a person skilled in the pertinent art to make and use the DETAILED DESCRIPTION OF THE 

invention. PREFERRED EMBODIMENTS 

FIG. 1 is a diagram of a typical MPEG video pictures 

display order. 65 Overview 

FIG. 2 is a diagram of a typical MPEG video pictures As recognized by the inventor, given that the MPEG 

coding order. encoding/decoding process can be decomposed into parallel 
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rt .and^serialoperations,. itseems^naturaLto ,use<som& kind^of « 
hybrid scheme to implement all the functions in the 
MPEG-2 encoding/decoding process. One hybrid scheme 
approach that combines a specially designed hardware com- 
ponent with an ordinary DSP chip is: 

(1) Use a specially designed ASIC with a parallel struc- 
ture to implement the 2-D DCT/IDCT and the motion 
estimation; and 

(2) Use an inexpensive DSP to implement the serial 
operations and provide the control structure to the 
ASIC. 

With this hybrid approach, the combined system not only 
can take advantage of the powerful parallel processing 
abilities of the hardware components, but also possesses the 
flexibility of software programming to cope with different 
encoding/decoding parameters required. The hybrid scheme, 
which includes dedicated 2-D DCT/IDCT and motion esti- 
mation (for encoder only) ASIC components plus a serial 
structured DSP chip, might be the best feasible architecture 
to meet the requirements of MPEG-2 real-time encoding/ 
decoding. 

Since the 2-D DCT/IDCT computation is the fundamental 
element of a MPEG video encoding and decoding, the 
development of a 2-D DCT/IDCT hardware module is a high 
priority. Existing systems have used a pure software 
approach to implement a MPEG-2 encoding/decoding sys- 
tem. The inventor has developed a new 2-D 8x8 DCT/IDCT 
algorithm and designed an ASIC to implement this algo- 
rithm. 

A 2-D DCT/IDCT algorithm according to one embodi- 
ment of the present invention is described in Section 3. 
Starting with a simple matrix notation of the 2-D DCT and 
2-D IDCT, it presents a detailed step-by-step description of 
the new algorithm. The algorithm is based on a direct 2-D 
matrix factorization and has better finite wordlength 
precision, requires fewer multiplication operations, and pos- 
sesses regular structure and localized interconnection than 
traditional approaches. Furthermore, it is shown in Section 
3 that the algorithm can easily be implemented with only 
adders, subtractors, and adder/subtractor combinations. 

Finite wordlength simulation of an embodiment of the 
algorithm is described in Section 4. The impacts of both 
coefficient quantization and truncation errors are fully inves- 
tigated. It is also shown in this section that optimal imple- 
mentation scheme is achieved by combining different finite 
wordlengths for coefficient quantization and data truncation. 
In order to meet the accuracy requirements of H.261 and 
JPEG for both the 2-D DCT and 1-DCT, only 16-bit finite 
internal wordlength is required by the proposed algorithm. 

Section 5 presents the detailed hardware architectural 
structure for the new 2-D DCT/IDCT algorithm according to 
one example implementation of the present invention. It is 
shown that the new algorithm leads to a highly modular, 
regular and concurrent architecture using standard compo- 
nents such as ac shuffler, adders, subtractors, accumulators, 
latches, and some multiplexers, etc. The combined 2-D 
DCT/IDCT architecture demonstrates that all execution 
components are 100% sharable between the 2-D DCT and 
2-D IDCT operations. 

The HDL design and logic synthesis processes for an 
embodiment of the algorithm are demonstrated in Chapter 6. 
Using a modern synthesis-oriented ASIC design approach, 
the chip implementation is simulated through I-ML func- 
tionality coding, RTL code simulation, logic synthesis from 
the verified RTL code and gate-level pre -layout simulation 
in several stages. The highly automated Computer Aided 
Design (CAD) tools used in the simulation process are 
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^Cadence 's .yerilog-XL® .simulation package.<and*Synopsys. * 
Design Compiler® synthesis package, respectively. The 
chip simulation shows that a 800 million samples per second 
throughput rate can be achieved for both 2-D DCT and 
IDCT computations according to the present invention. 

Finally in Section 7, contributions of the present invention 
and applications of this invention are discussed. 

OTLINE OF DETAILED DESCRIPTION 
SECTION 

3.0 2-D DCT/IDCT Algorithm 

3.1 Introduction 

3.2 2-D 8x8 DCT Algorithm 

3.3 2-D 8x8 IDCT Algorithm 

3.4 Further Simplification of the 4x4 Matrix Multiplica- 
tions 

3.5 Summary 

4.0 Finite Wordlength Simulations 

4.1 Introduction 

4.2 Coefficient Quantization Error Effects 

4.2.1 2-D DCT Simulation Results 

4.2.2 2-D IDCT Simulation Results 
43 Truncation Error Effects 

4.3.1 2-D DCT Simulation Results 

4.3.2 2-D IDCT Simulation Results 

4.4 Combined Quantization and Truncation Error Effects 

4.5 Comparison with Row-Column Method 

4.6 Summary 
5.0 Hardware Architecture Design 

5.1 Introduction 

5.2 Shuffler — Addition/Subtraction Shuffling Device 

5.3 Sub-block Operator-— 4x4 Matrix Multiplications 
Unit 

5.4 Auxiliary Components for 2-D DCT or IDCT Imple- 
mentations 

5.5 Architectures for DCT, IDCT and Combined DCT/ 
IDCT 

5.6 Summary 

6.0 HDL Design and Synthesis for an Example 2-D DCT/ 
IDCT Algorithm 

6.1 Introduction 

6.2 HDL Design for Shuffler 

6.2.1 HDL Design for Sub-block Operators 

6.2.2 HDL Design for Auxiliary Components 

6.2.3 HDL Simulation for Combined 2-D DCT/IDCT 

6.3 Login Synthesis for Example 2-D DCT/IDCT Algo- 
rithm 

6.4 Summary 
7.0 Conclusions 

7.1 Contributions of this Invention 

7.2 Other Applications 

3.0 2-D DCT/IDCT ALGORITHM 
In this section, a 2-D 8x8 DCT/IDCT algorithm according 
to the present invention is described. Based on a direct 2-D 
approach, not only is this algorithm more computation 
efficient and requires fewer multiplications than traditional 
approaches, but it also results in a simple, regular architec- 
ture with localized communication interconnections. 

3.1 Introduction 

In 2-D DCT or IDCT chip development, almost all the 
2-D DCT or IDCT processors developed so far have made 
use of the separability property of the 2-D DCT or IDCT. 
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? .MhGugh>the-2-D,,DCTyorJDCT based.on RCMapproacb « 
can be realized using a very simple, regular structure with 
relative low design and layout-cost, there are several major 
drawbacks in almost all 2-D DCT/IDCT implementations 
based on the RCM approach [Ura96, SL92, Jan94, MW95, 
SL96, etc.]: 

(1) Memory components are required to store the inter- 
mediate results between the 1-D row and 1-D column 
transform. And memory cells take a lot of silicon area to 
implement. 

(2) Because it is relatively difficult to design the memory 
block with multiple read/write accesses, serial data in and 
serial data out mode is adopted by most of the RCM 
approaches. Serial data I/O results in relatively low system 
throughput for 2-D DCT or EDCT operation. Generally, 
RCM approaches can only achieve a half of the system clock 
rate as system sample processing rate, since the second 1-D 
transform will not start until the first 1-D transform finishes 
and the transposed intermediate data is ready. But exceptions 
have been made to achieve throughputs as high as the system 
clock rate by using a two intermediate memory buffers and 
transpose circuitry such that the intermediate data are stored 
in each of the memory buffers alternatively and latency 
constraints of the intermediate data can be avoided [UraM]. 
Some have adopted different I/O clock and system clock 
rates to balance the I/O and data processing speed [SL96]. 

(3) Complex transposition hardware is required to trans- 
pose the output of the 1-D row (column) transforms into the 
input format of the 1-D column (row) transforms. The faster 
matrix transposition the system requires, the higher com- 
munication complexity it will involve. 

(4) The latency of the RCM is relatively high because the 
1-D row and column transform must be calculated sequent 
tially. 

(5) The separability property of the 2-D DCT or IDCT 
used by the RCM limits it to be able to make full use of the 
1-D optimal solution, and it is not possible for them to take 
the full advantage of 2-D's sparseness and factorization. 

Although the direct 2-D DCT or IDCT matrix factoriza- 
tion is more computation efficient and generally requires a 
smaller number of multiplications, the major obstacle pre- 
venting this approach from being implemented in VLSI 
hardware is the complexity of its global communication 
interconnection. Hie present invention provides an algo- 
rithm which makes full use of the computational efficiency 
of a direct 2-D approach and has localized communication 
interconnection^) so as to be suitable for VLSI implemen- 
tation and meet the speed requirement of video applications, 
including real time applications. 

This section describes an algorithm that achieves this 
goal. Direct 2-D DCT and IDCT algorithms are presented 
step-by-step in section 3.2 and 3.3, respectively. In addition, 
the core component of these direct algorithms according to 
one embodiment of the present invention is characterized in 
detail in section 3.4. A summary is provided in section 3.5. 

3.2 2-D 8x8 DCT Algorithm 

Let X(m,/i) and X(k,l) be the NxN input and output 
sequences for 0^m,n<N, then the forward 2D Discrete 
Cosine Transform in Eq. (2.32) can be rewritten as: 



Z(fc, /) = - C(*)C(0£ £ Xtm, n)cos 



(2m 4 



IN 



l)Jfcjr (2n+l)U 
-cos - 



(3.1) 
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.where , 



[i/vT 



otherwise 



For N+8, a coefficient vector W, which is the cosine 
function of angles (Kjc/2N) for k«l, 2, . . . , N-l, can be 
defined as 



10 



*1 ' 




cos(;r/16) " 
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(3.2) 



W = 



In the meanwhile, the 2-D 8x8 DCT in Eq. (3.1) can be 
expressed in matrix notation similar to 1-D case in Eq. (2.8), 
as: 



Z=AXA T 



3-3) 
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where A r is the transpose of matrix A and the elements of A 
are 



1 .(2/ + 1)ot 
0,-,= -C(0cos- 



2N 



If each a,-, is replaced with the elements of coefficient 
vector W, the matrix A can be expressed as the function of 
coefficient w t for k-1, 2, . . . , N-l as 
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(3.4) 



From Eq. (3.3) one can see that the 2-D DCT can be 
decomposed into two stages of 1-D DCT as [Ura92, Jan94, 
MW95]: 
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(3-5) 



IN 



and a total of 2N 3 multiplications are required to compute 
matrix X by brute force approach. Since the even rows of 
matrix A (i.e., A(0), A(2), A(4) and A(6)> are even- 
symmetric and the odd rows (i.e., A(l), A(3), A(5) and A(7)) 
are odd-symmetric, it is possible to facilitate the computa- 
tion of 1-D column transform of Y«AX by simply switching 
all the even rows of matrices A and Y to the top half and all 
the odd rows to the bottom half, which can be carried out by 
multiplying a permutation matrix PI. Two new matrices Y' 
and A' can be defined as 

r=PVY, 



(36) 
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(3.10) 



--P2AP4, 



zP3AP4 



By expressing matrices Y* and A' as the functions of the 
row vectors of the matrices Y and A, Eq. (3.6) can be further 
extended as 



(3.7) 



y(0) 




A(0) 


Y(l) 




A(2) 


Y{4) 
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Y(6) 
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Y(l) 
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V(5) 
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Y<0. 
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where P2 and P3 are defined in Eq. (3.7) as the top and 
bottom blocks of matrix PI, and P4 as a new permutation 
is matrix that takes the first four columns of an 4x8 matrix to 
form an 4x4 one. Mathematically, the matrix P4 can be 
defined as 



X = 



20 



w 4 



w 4 w+ W4 VW4 



w 4 



VV 4 - W4 — w 4 W 4 H4 — vv 4 — w 4 w 4 

w 6 _w 2 ^2 -v 6 -VV 6 W2 -V*2 ™6 

W\ VV3 W5 Vv*7 -VV; -W5 -Wj -Wi 

W) -W7 — W t -W5 Wi W7 -Wj 

VVJ -Wt W7 Wj -W3 -W7 Wi -Wj 

W/ -W5 WJ -WJ Wl -WJ W5 -W7 



where Y(k) and A(k) are the row vectors of matrices Y and 
A, respectively. Now, the rows at the top half of matrix 
A'^Pl Aare even-symmetric and the rows at the bottom half 
are odd-symmetric. Thus, the matrix multiplication of 
Y*=A'X can be calculated through two 4x8 matrices as 



P4 



■liL 



(3.11) 



where the matrix I 4 is an 4x4 identity matrix, and the matrix 
N 4 is an 4x4 null (zero) matrix. 
25 In addition, the matrices [X(i)+X(j)] and [X(i)-X(j)] in 
Eq. (3.8) and Eq. (3.9) can also be defined as two separate 
4x8 matrices as X+ and X_ into their left and right blocks, 
respectively, as: 
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= {[X+t X^. r ]) 4x t = 



X. =([*-( X_ r ]) 4x8 = 



X(0) + X(7) 
X(l)+X{6) 
X(2) + X(5) 
XO) + X(4) 

X(Q)-X(?) 
X(l)-X(6) 
X(1)~X(5) 
X(3)-X(4) 



(3.12) 
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X(t) + X(6) 
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(3.9) 
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After substituting the X^, X_, X+ t , X^ X_, and X_ r in Eq. 
(3.12) and the E, O in Eq. (3.10) into Eq. (3.9), the 1-D 
column transform Y-AX can be calculated by first calcu- 
lating its permutation Y*. The substitution can be carried out 
as follows: 
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(3.13) 



ISxS 
EX+l EX+r 

OX-t OX-, 



and 



y = (Pir 1 y' 



55 Furthermore, if the input matrix X is decomposed into 
four 4x4 sub-matrices as 



where X(k) is the row vector of matrix X. 

It can be seen that one 4x4 coefficient matrix in Eq. (3.9) 60 
only includes the even coefficients of vector W (i.e. w 2 , w 4 , 
w 0 ) and the other only includes the odd coefficients (i.e. w 2 , 
w 3> w 3 , w 7 ). In fact, they can be defined as two new 4x4 
coefficient matrices E and O [Jan94, MW95]. By means of 
matrix notation, E and O can also be computed directly from 65 
the coefficient matrix A by the following matrix operations 
as: 



1 U- T 4*4 



X24X4] 
*<4x4 L 



(3.14) 



Using the Eq. (3.12), the matrices X w , X +r , X_, and X_ r 
can also be expressed as the functions of matrices XI, X2, 
X3 and X4 as 



07/09/2004, EAST Version: 1.4.1 



US 6,587,590 Bl 



31 



^.,^-7^4 (3.15) 

where I 4 is defined as an 4x4 opposite identity matrix as 



{Y'f = 



0001 
0010 
0100 
1000 



(3.16) 



Since the X w , X +r , X_, and X_ r have been expressed as the 
functions of matrices XI, X2, X3 and X4 in Eq. (3.15), by 
substituting them into Eq. (3.13), the first stage of 1-D 
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E[XI + 1 4 X3 E(X2 + hX4)f (3.20) 
0(X1 - 1 A X3 0(X2 + / 4 ^)J 8x8 
(A-/ + UX3) T E 7 (X1 - 7 4 X3) r O T 
(X2 + 1 4 X4) T E T (X2 - 7 4 X4)V 



by replacing the XI, X2, X3 and X4 with (Xl+I 4 X3) r E r , 
(Xl-I 4 X3) r O r , (X2+I 4 X4) r E r and (X2-I 4 X4) r O r in the 
second equation in Eq. (3.17), the matrix multiplication 
(Z') z =A(Y') r can be computed as 



= (PIT 1 



E{(X1 + 1 A X3) T E T + 7 4 (*2 + 1 4 X4) T E T ) E((X1 - 1 4 X3) T 0 T + U (X2 - l 4 x4f O t ) 
0((XJ + 7 4 X3) T E r - 7 4 (*2 + UX4) T E T ) 0((X1 - I 4 X3) T 0 T - 7 4 (*2 - 7 4 X4) T 0 T ) 
E[{XI + UX3) T + 7 4 (X2 + hX4) T )E T E((X1 - 1 4 X3) T + 1 4 {X2 - UX4) T )o 7 
0((X1 + 7 4 X3) T - U(X2 4- 1 4 X4) T )E T 0((X! - l 4 xSf - I 4 (X2 - I 4 X4) T p T 



(3.21) 



column transform Y=AX can be expressed directly as the 
function of input matrix X as 



r = 



E[XI + T 4 X3) E[X2+1 4 X4) 
0{XJ-1 4 X3) 0(X2-hX4) 



(3.17) 



25 Consequently, one can compute the 2-D DCT result Z by 
first solving the matrix (Z') r through Eq. (3.21). Let's define 
four new 4x4 matrices X++, X_+, X^_ and X__ directly from 
input matrix X as 



Y^(Pir l r =(/»/)- 



E[XJ + 1 4 X3) E{X2 + f 4 X4) 
0(Xl-l 4 X3) 0{X2-1 4 X4) 



where matrices PI, E and O are defined in Eq. (3.7) and 
(3.10), respectively. 

Similar mathematical manipulations can be applied to the 
second stage of 1-D row transform Z=YA r , too. By switch- 
ing the row vectors of matrix Z, a new matrix Z' can be 
formed as the function of the row vectors of matrix Z as 
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A r _=C«l-7 4 X3) 7 '+7 4 (¥2-UY4) r ( 
35 X+_=QCl+I 4 X3f-T 4 Qa+l A X4) r , 
X_=(Xl-j 4 X3) T -h(X2+I 4 X4) r , 

^ such that the Eq. (3.21) can be rewritten as 



Z(0) 
Z(2) 
Z(4) 
Z(6) 
Z(I) 
Z(3) 
Z(5) 
Z(7) 



;PIZ = P1YA T - 



Y(0) 
Y{2) 
Y{4) 
Y(6) 
Y(i) 
Y(3) 
Y(5) 
Y(7) 



(P/)(Z') r = 



(3.18) 



EX++E 7 EX^O 7 
OX^E 7 OX^O 7 



(3.22) 



(3.23) 



By decomposing (Pl)(Z') r into four 4x4 matrices as its 
top-left, top-right, bottom-left and bottom-right blocks as 



(P7)(Z') r = (P7)(P/-Z) : 



T _rz7 4 x4 22^1 
[Z3 4M Z4 4k4 \^ 



(3.24) 



Take a transpose on the both sides of Eq. (3.18), the 
transposition of matrix Z' can be expressed as 



(r) T =(rA 7 ) T =A(Y) T 



one can finally compute the elements of the 2-D DCT 
Z=AXA r through Eq. (3.23) as 



(3.19) 



The matrix multiplication (Z') T =A(Y') r can be calculated 
in the same fashion as calculating Y=AX since they are the 
same matrix multiplication in essence (i.e. matrix A multi- 
plied by an another matrix). 

In order to be able to use Eq. (3.17) to compute (Z') r =A 
(Y') r , (Y'Y should be decomposed as four 4x4 sub-matrices 
as done to the matrix X. By taking the transpose on the both 
sides of the first equation in Eq. (3.17), one can decompose 
the (Y') r as 



60 
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Z2 = 



Z00Z20Z40Z5G 
ZmZmZmZs* 

Z10Z30Z50Z70 
ZL2Z32Z52Z72 

Z\ 4 Z-)aZ^Zi 4 



(3.25) 



: EX^E T y 



= EX. + O t 
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* -continued 



73-. 



Z4 = 



ZiiZnZstZji 

ZisZysZssZji 

ZnZjjZsrjZjy 



= ox__o r 



X(m, n) = 



(3-26) 



(2m + l) (2n + l) 

005 — ~> — where C(k), 



IN 



IN 



C(l) 



■{"I 



/V2. 



k, 1 = 0 

otherwise 



For N=8, the definition of the coefficient vector W in Eq. 
(3.2) and the coefficient matrix A in Eq. (3.4) can also be 
used in the computation of the 2-D IDCT. And Eq. (3.26) can 
be expressed with the 2-D IDCT matrix notation as 



X=A T ZA 



Z=A T XA 



Z-YA-(A r Y 1 ) T 
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r = 



10 



V(0) 

Y{2) 
YO) 
Y(7) 
Y(6) 
K(5) 
W) 



(3.30) 



• P5Y = {P5A T )X 



Without the present invention, one would need to com- 
pute the matrix product of three 8x8 matrices in Z=AXA r . 
By using Eq. (3,25), the result matrix Z is decomposed as 
four 4x4 matrices, and each of them can be calculated as the 
matrix product of three 4x4 matrices, the "half-size" opera- 
tions compared with original 8x8 one. The total number of 
multiplications is reduced from 2N 3 to N 3 since each of the 
4x4 matrix products requires 2(N/2) 3 multiplications when 
computed by brute force. 

3.3 2-D 8x8 IDCT Algorithm 

Similar to the 2-D DCT, one can also decompose the 2-D 
IDCT into a much simpler form, and thus reduce the total 
amount of computation of the 2-D IDCT. 

Let's define Z(kjl) and X(m,n) as 2-D IDCT input and 
output matrices, respectively. The 2-D IDCT definition in 
Eq. (2.33) can them be rewritten as: 



15 



to reverse the orders of the rows in the bottom half of 
matrices Y and A r , where the permutation matrix P5 is 
defined as 



P5 = 



U *4 

N A U 



(3.31) 



20 



Id order to have consistent input and output matrix 
notations with the 2-D DCT, X represents a 2-D IDCT input 
matrix and Z represents a 2-D IDCT output matrix in the rest 
of this section. In this way, the 2-D IDCT matrix expression 
in Eq. (3.27) will be rewritten again as 



Thus, the matrix multiplication of Y"«(P5 , A 7 )X can be 
calculated through two 4x8 matrices as the functions of the 
row vectors of matrix X as [Ura92,MW95]: 



30 



Y(0) 
Y(l) 
Y(2) 
YQ) 

YO) 
Y(6) 
Y{5) 
Y(4) 



X{0) 
X(2) 
X(4) 
X(6) 

X(Q) 
X{2) 
X(4) 
X(6) 



-O' 



X{\) 
XO) 
*(5) 
X(7) 
X(i) 

XO) 

*(5) 
XO) 



(3.32) 



35 



where the matrices E and O are the same ones defined in Eq. 
(3.10). 

By defining two 4x8 matrices X tf and X 0 as the even rows 
(i.e. X(0), X(2), X(4) and X(6)) and the odd rows (i.e. X(l), 
40 X(3), X(5) and X(7)) of matrix X as: 



E{X1*UX3) E[X2 + 7 4 X4) 
0(X1-1 4 X3) 0(X2-7 A X4) 



(3.33) 



(3.27) 



45 



X< z 



50 



(3.28) 



The 2-D IDCT in Eq. (3.28) can also be decomposed into 
two stages of 1-D IDCT as [Ura92, MW95]: 



X(0) 
X(2) 
X(4) 
X(6) 

X(D 
XO) 
X(5) 
XQ) 



= P3X 



(3.29) 60 



where matrices P2 and P3 are defined in Eq. (3.7), the 1-D 
IDCT column transform Y-A r X can be computed through 
Eq. (3.30), (3.32) and (3.33) as 



Note that the notation Y is reused to express the result of 1-D 
IDCT column transform, which is different from the Y in Eq. 
(3.15) as the result of 1-D DCT column transform. 

Because of the symmetric characteristics of the coefficient 
matrix A, the first stage of 1-D IDCT Y»A r X can be 
computed through a permuted matrix 



r = 



E T x t + o T x Q 

E T X t -0 T X 0 



65 



Y = (P5r l Y~ = (P5)- 



and 

E T X t + (fX a 
E r X t - 0 T X o 



(3.34) 
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, Defining another permutation matrix P6 ..as , 

P6=\ 



36 

--continued 1 > ■ 



(335) 



then the matrices and X 0 can be separated into their left 
and right blocks as 

X er X*P4>-P2XP4, 

X cr -X c P6-P2XP6> 

x^-x^-ps-psx-pe, 
such that the Eq. (3.33) can be rewritten as 



(( y f) f r = P2 • c y f ■ P6 = P2 
ar) T ) ol = P3-(rf-P4 = P3 
ar) T ) Q , = P3(Y"fp6 = P3 



10 



X^E-X^O 
XjrE+XTO 

XlE-Xl0 
Xj r E-Xj r O 



In order to resolve the Eq. (3.42) as a serial of 4x4 matrix 
operations, four new 4x4 matrices P7, P8, P9 and P10 can 
15 be defined as 



(3.43) 



r = 



Y - {P5)~ l Y" = (P5) _ 



and 



E T X ei +0 T X ot E T X er + 0 T X or 
E T X (t -0 T X oi E T X„-0 T X or 

E T X el + O r X ol E T X er ±0 T X„ 

E T X ti -O r X ol E T X tr ~O r X or 



(3.37) 



20 



And the 8x8 matrix multiplication Y'^^S-A^X has been 
simplified as several "half-size" matrix multiplications. 

The second stage of 1-D IDCT can be evaluated in the 
similar way. After multiplying the permutation matrix P5 oq 
the both sides of Z=YA, a new permuted matrix Z" can be 30 
defined as 

Z'=PSZ=(PSY)A=Y"A (3.38) 

Thus, the transposition of matrix Z" can be expressed as 



35 



(z") r ^ r (n r 



(3.39) 



which happens to have the same format as the first stage of 
1-D IDCT Y=ATX does. And the matrix (Y") z can be 
computed directly by transposing Y" in Eq. (3.3.7) as 40 



By replacing P2 and P3 with P7, P8, P9 and P10, the Eq. 
(3.42) can be further expressed as 

(0") r )w = P7(XlE+ X^O) + PRXlE + Xj,0) (3.44) 
= (P7 Xji + P8 - Xj r )E + (P7Xjt + P8- 

{{Y') T \ r = P7{XlE- Xj,0) + P8(Xj r E- XlO) 

= (F7- Xl +■ P8- Xl)E -(P7XT + P8- Xj r )0, 

((n 7 )* = P9t^£+ xZtOy+piotxJrE+xlo) 

= {P9> Xj t + P10 Xl)E± (P9X^^- P10- Xj r )0, 

«n r U = P<KXlE - XlO) + PKXXlE - Xj r O) 

= (P9- Xl + PI0- Xl)E- {P9 +■ />/0- Xj r X> 

By defining four 4x4 matrices in the following way as 



E T X ci ^O T X ol E T X er +0 T X or V (3.40) 
E T X ei ~O r X oi E 7 X„~O r X vr \ 
XjtE+X^O X££-X£0l 
XlE+XlO XlE-X£o\ 



Furthermore, by separating (Y") r into even-left, even- 
right, odd-left and odd-right four 4x4 sub-matrices as done 50 
for matrix X in Eq. (336), the result of Eq. (33.7) can be 
directly used to express the matrix (Z") r as 



{pst 



(3.41) 



&Wrf)H + 0 T {{Y"f) ot £ r ((O r )„ + (f{{Y"f) m 



55 



In fact, the even-left, even-right, odd-left and odd-right 60 
four 4x4 sub-matrices of (Y") T can be computed by replac- 
ing the matrix X with the (Y") r in Eq. (336) as 



{iY') T ) et =P2-(Y'fP4. 



JxZe+xZ 
'' P [xJ r E + x! r 



O 



(3.42) 



X et = P7Xj^P8Xl 

■= P7(P2- X ■ P4) T + P8{P2X ■ P6f 
= P2(P2Xf 

X oe = P7.xl±P8Xl 

= P7{P3- X • P4f 4-P8{P3 X- P6) T 
= P2(P3X) T 

X„ = P9XZ + Pt0Xj r 

= P9(P2- X P4f +■ PKKP2- X • P6f 
= P3(P2X) T 

X< x , = P9 XZ t + PJO- x£ 

= P9(P3* X-P4) T + PKXP3- X • P6) T 
= P3{P3X) T 

Eq. (3.44) can be rewritten as 

{(rj^erXJt+xjD, 
((r^r-x^E-x^o, 
((n^orXnE+x^o, 
((D^x^-x^o, 



(3.45) 



(3.46) 
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Further substituting, Efl..(3.46).into.Eq..(3 i 41),yields - 



38 



£ 7 (X„E + X^O) +• O r (X ta E + X^O) £ r (X„£ - X^O) + ^(X^f - X^O) 
E T (X«E +• Jf^O) - O r (^£ +X 00 0) E T (X ee E - X M 0) - CT{X„E - X^O) 
E r X tt E + E T X ot O + XcoE + O r X^O £ r X„E - £ r X>, 0 + O r X, 0 £: - O r X«,C? 
£ 7 X„£ ■«- E T X oe 0 - O r X„£ - O 7 "*^*? E T X„E - E T X ot O - 0 T X to E + 0 T X ao O 



(3.47) 



By decomposing (P5)(Z") r into four 4x4 matrices as its 
top-left, top-right, bottom-left and bottom-right blocks as 



(3.48) 



one can finally compute each element of the 2-D IDCT 
Z*A T XA through Eq. (3.27) as 



Z6 = 



Z7 = 



Zoo 


Z10 


Z 20 


Z30 


Zoi 


Z n 


Z21 


Z31 


202 


Z11 


Z22 


z 32 


2ft3 


Zl3 


Z23 


z 33 

t 


Z7Q 


Z«. 


Zso 


Z40 


Z71 


Z«L 


Z51 


Z*i 


Z72 


Z62 


Z52 


Z42 


Z73 


z© 


z» 


Z« 


Z07 


Zt7 


Z27 


Z37 


Z06 


Zt6 


Z26 


Z M 


Z05 


Z15 


Z25 


z 3J 


Z04 


Z|4 


Z24 


Z34 


Z77 




Z57 


Z41 


Z 7 <S 


z« 


Z56 




Z73 


Z6S 


Z55 




Z74 


Z64 


Z« 


Z44 



(3.49) 



E T X ee E + E r X 0 ,<? + O r X eo E + O t X^O, 



E T X„E - E T X 0€ 0 + O 7 "*^ - C? r X«,0, 



E T X„E + E T X oe O - O r X w £ - O T ^0, 



ZS = 



E T X tt E~ E T X ot O - O 7 *^ -h O r X«,0 

Similar to 2-D DCT, the 2-D IDCT Z=A r XA is also 
simplified from three 8x8 matrix multiplications to four 4x4 
matrix multiplications E r X^E, E r X^O, O r X tf JE and 
O^X^O, where each of them can be calculated as the matrix 
product of three 4x4 matrices, the "half-size" operations 
compared with original 8x8 one. Since the matrices P7, P8, 
P9 and P10 used in Eq. (3.45) are all pure permutation 
matrices, so no extra multiplication has been introduced for 
computing X„, X^, X rc7 and X w , 

Generally speaking, each element of matrix Z can be 
calculated with the same general formula as 



3 3 

X Yi &U ■ ( e i> e lj> ± *U ■ V ± Jfff • ± • {0 ki 0tj)} 



(3.50) 



where 



X^x,/^ X^AxjfX X„=t Xk n X^xJ^l *«» 



15 



3.4 Further Simplification of the 4x4 Matrix 
Multiplications 

Eq. (3.25) and Eq. (3.49) show that the core components 
of the 2-D DCT and 2-D IDCT algorithms are four matrix 
products, where each of them consists of the matrix multi- 
plications of three 4x4 matrices. 

Generalizing the product of three 4x4 matrices, such as 
EX^E 7 *, EX^O 7 ; OX + _E r , OX„O r E T X ee E y E r X^O, 
O^E and O^O, as: 

V^B^U^C^J (3-51) 

and the function unit to implement B 4><4 U 4x4 C 4><4 r is 
considered as a sub-block unit. For V=[v^], B=[b, y ], U=[u i;f ] 
and C=[c l7 ], the multiplications of the three 4x4 matrices can 
25 be carried out by switching the order of the J, and combining 
the b* and c ;7 together such that each element of matrix V 
can be determined as: 



20 



30 



f 3 \ 3 3 3 3 

*rrf U=o ) t-o k=o fc=0 *=0 



(3.52) 



From this equation, one can see that each v if> 0^ i, j ^3, is 

35 expressed as a sum of products of u w *(b tt c y7 ) for O^k, 1 ^3, 
where u w is a function of the input sequence X (see Eq. 
(3.22) and (3.45) above), and b* c /7 is a function of the 
coefficient matrix A (see Eq. (3.10) above) and can be 
pre -calculated as one of the ±w M w^ t^m,n^7. Since each 

40 w m defined in Eq. (3.2) has 7 possible values, there are total 
28 different combinations for pre-calculated constants 
w M w n ,lim,al7. 

Since b A c j7 is a pre-calculated constant, the multiplication 
v-ki'fyuPjd ^ nt0 ^ e P attern x *d» which, in fact, is a 

45 variable multiplied by a constant instead of the multiplica- 
tion between two variables as x-y. The multiplication 
between a variable and a constant can be very easily 
implemented by a group of hardwired adders with not need 
for using real multiplier. 

50 Further reviewing of multiplication u^ O 5 ^ c ;7 ) as a basic 
processing element (PE) of the proposed 2-D DCT/IDCT 
algorithm shows that because the computation of u w (b^ c ;7 ) 
is exactly the same as computing x d, this algorithm only 
suffers one coefficient quantization loss and one computa- 

55 tion truncation loss when the X^ are used to directly 
compute the final 2-D DCT/IDCT output results in Eq. 
(352) (given there is no truncation loss for all additions and 
subtractions in Eq. (3.49) for 2-D IDCT). In contrast, the 
row-column decomposition method suffers at least two 

60 coefficient quantization losses and two computation trunca- 
tion losses^ — one occurs when computing the 1-D column 
transform and another one occurs when computing the 1-D 
row transform, and it is prone to both accumulated errors 
and error propagation from the first 1-D to the second 1-D 

65 transform. Thus, much higher computation accuracy can be 
achieved by the proposed algorithm given that the same 
finite wordlength is adopted by both approaches. 
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-a Since ^each-. 4x4, sub-block can be implemented without , 
multipliers, the proposed algorithms can be implemented 
with only adders and sub tractors as the basic processing 
elements (PE). This results in a great reduction of the 
complexity and design cost of hardware implementations. 

Besides, each 4x4 sub-block is totally independent from 
other sub-blocks in the proposed 2-D DCT and 2-D IDCT 
algorithm. There is no communication interconnections 
among the four sub-blocks in either 2-D DCT or 2-D IDCT 
computation, which means the drawback of complex global 
communication interconnection associated with other exist- 
ing direct 2-D DCT and 2-D IDCT approaches has been 
overcome by the reduction of routing complexity for hard- 
ware implementations. 

Another advantage brought up by the localized intercon- 
nection is that a paralleled architecture can be adopted to 
implement the four 4x4 sub-blocks independently. Parallel 
data in and parallel data out I/O scheme will guarantee that 
the system throughput can meet the requirements of current 
and future video applications. 

3.5 Summary 

In this section, an algorithm to compute 2-D 8x8 DCT/ 
IDCT according to the present invention has been presented. 
Based on direct 2-D coefficient matrix factorization 
approach, the 8x8 DCT/IDCT can be calculated through 
four 4x4 sub-blocks, which are only "half -size" of the 
original one. 

Further simplification of the core component — 4x4 sub- 
block shows that this direct 2-D approach not only is more 
computation efficient and requires a smaller number of 
multiplications, but also has localized interconnection and 
can be easily implemented with paralleled structure to 
accommodate four independent sub-blocks. 

Moreover, each multiplication in this scheme has been 
confined as a variable multiplied by a constant instead of two 
variables in general. Every multiplication operation can be 
very easily fulfilled by a group of hardwired adders and the 
whole 2-D DCT/IDCT computation can be carried out by 
using only adders and subtractors. The higher computation 
accuracy of this scheme means that a shorter finite internal 
word length can be used in the hardware implementation of 
the algorithm while the same accuracy requirements for both 
2-D DCT/IDCT can still be met. A shorter internal finite 
wordlength means that fewer number of registers and less 
complicated circuit are required for the hardware implemen- 
tation. 

The simplified processing elements (just adders and 
subtractors, no multiplier is required), paralleled sub-block 
structure, localized interconnection and shorter finite inter- 
nal wordlength associated with the proposed 2-D DCT/ 
IDCT algorithm demonstrate that the proposed algorithm is 
a perfect candidate for VLSI implementation. 

4.0 FINITE WORDLENGTH SIMULATIONS 

In this section, finite wordlength simulations of a 2-D 
DCT/IDCT algorithm according to an embodiment of the 
present invention are carried out. The simulation results 
show that the algorithm can meet JPEG 2-D IDCT specifi- 
cation with only 16-bit finite internal wordlength for the 
arithmetic operations, which means that all additions, sub- 
tractions and multiplications required in this algorithm use 
no more than 16-bit. 

4.1 Introduction 
In the hardware implementation of any algorithm, there 
are tight trade-offs among various quantities like accuracy, 
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speed, and chip area,- etc. For-a.2rIXDCE.or IDCTalgorithnv 
it is coefficient quantization and finite wordlength truncation 
that are the two major factors which decide the accuracy, 
speed and chip area. 

5 To represent any cosine coefficient cos(iJt/16), i=0, 1 , . . . , 
15, with finite wordlength introduces coefficient quantiza- 
tion (or coefficient approximation) errors. Furthermore, the 
implementation of any arithmetic operation with finite inter- 
nal wordlength arithmetic (due to fixed register length) 
introduces truncation (or rounding) errors. To minimize the 
effects of quantization errors, more bits are needed to 
approximate the cosine coefficients cos(in/16), which would 
require wider inputs for multipliers. To minimize the effect 
of truncation errors, wider registers are required for each 

15 arithmetic operation. Doing so, however, results in a slower 
critical path a larger chip area for each execution unit. In 
fact, the optimal coefficient and register width can lead to a 
higher speed and a smaller chip area. However, both widths 
should be chosen to ensure the minimum accuracy criteria 

20 for 2-D DCT specified by ITU-T Recommendation H.261 
and 2-D IDCT specified by the Joint CCITT/ISO committee 
(JPEG). 

For a 2-D DCT, the final result is computed by using Eq. 

25 (352), where all other quantities can be precisely pre- 
computed. For a 2-D IDCT, the final result cam be computed 
by using Eq. (3.52) and (3.49), where both coefficient 
quantization errors and finite wordlength truncation errors 
are still determined by Eq. (3.52). So the error analysis will 

30 be focused on all the arithmetic operations of Eq. (3.52), 

In section 4.2, the coefficient quantization error effects for 
the algorithm according to the present invention are inves- 
tigated. In section 4.3, the focus is shifted to the effects of 
arithmetic operations with different finite internal 
3S wordlengths. One optimal candidate for VLSI 
implementation, which combines optimal coefficient quan- 
tization errors and truncation errors, is presented in section 
4.4. 

40 

4.2 Coefficient Quantization Error Effects 

For the coefficients w-i=(V£)cos(ijt/16) in Eq. (3.2) one can 
factor the constant "W out of each w,-. By using cos(ijc/16) 
4S instead of w, in Eq. (33) and (3.28), one must scale by four, 
which can be overcome by shifting the final results right for 
two bits. Let's define new coefficient parameters as: 

Qy-4w,->v / ^H' / H' 1 -cOT(iii/16)cos(j3^1 6) (4. 1) 

50 Since 0<cos(ui/16)<l, the would still fall into the 
range 0<Q iy <l, ij=0, 1, . . . , 7. In Eq. (4.1), the calculation 
of cos(Wl6)cos(ijc/16) can be precisely pre-computed so 
that there is no precision loss. And the quantization error is 

55 greatly reduced as only one approximation error instead of 
two approximation errors is associated with each multipli- 
cation unit (UjuQ;-). 

In rest of the section 4.2, the impact of coefficient quan- 
tization errors for the algorithm according to the present 

50 invention is investigated. The impact of truncation errors 
caused by finite wordlength can be overcome by using a total 
of 31-bit in Eq. (3.52) for both 2-D DCT and 2-D IDCT. 

Table 4, 1 shows a 16-bit representation for the coefficients 
Q ( y, which is the highest quantization precision used in the 

65 simulation for the proposed algorithm. The maximum quan- 
tization error with 16-bit representation for all Q if is 
0.000007. 
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TABLE 4.1 



(Hex) 


J - 1 j - 2 


i - 3 


j - 4 


j " 5 


j = 6 


j = 7 


i- 1 


0.F642 0.E7F8 


0.D0C 


0.B18 


0.8B7 


0.6016 


0.30FC 






4 


B 


E 






i»2 


0.DA8 


0.C4A 


0.A73 


0.8366 


0.5 A8 


0.2E24 




3 


7 


D 




2 




i«3 




O.BOF 


0.9683 


0.7642 


05175 


0.2987 






C 










i - 4 






0.8000 


0.6492 


0.4546 


0.2351 


i-5 








0.4FO4 


0366 


0.1BB 












D 


F 


i - 6 










0.257E 


0.131D 


i - 7 












0.09B 














E 



4.2.1 2-D DCT Simulation Resets 

Hie simulation of quantization errors for an example of a 
2-D DCT algorithm according to the present invention is 
carried out on 10,000 sets of W input data. Each input data 
is randomly generated within the range of -256 to 255. The 
final 2-D DCT outputs are rounded to 12-bit integers. 

The accuracy requirements for the 2-D DCT simulations 
are adopted from the H.261 Specification. Each of the W 
DCT output pixels should be in compliance with the speci- 
fication for parameters like Peak Pixel Error, Peak Pixel 
Mean Square Error, Overall Mean Square Error, Peak Pixel 
Mean Error and Overall Mean Error for each of the 10,000 
block data sets generated above. The reference data used in 
the statistical calculation are generated by the formula in Eq. 
(2.32). Additionally, the error of DC component is analyzed 
since it is the most important parameter for 2-D DCT. The 
simulation results and accuracy requirements of H.261 for 
2-D DCT are shown in Table 4.2. 



20 
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complicated way, , . which . can :ibe: . summarized as >f JPEG,- 
RY90]: 

(1) Generate random integer data values in the range -L 
to +H.10,000 block sets of 8x8 input data should be 
generated for (L=300, H=300), (1^256, H=255) and 
(L-5, H=5), each; 

(2) For each 8x8 input data, 2-D DCT is performed with 
at least 64-bit floating point accuracy; 

(3) For each block, the 8x8 transformed results are 
rounded to the nearest integer values and clipped to the 
range -2048 to +2047; 

(4) The "reference" 2-D IDCT output results are com- 
puted with at least 64-bit floating point accuracy, in 
which the input data are the data generated in step (3), 
the output data are clipped to the range -256 to +255; 

(5) The proposed 2-D IDCT algorithm ("test") is used to 
compute 2-D IDCT output data with the same input 
data generated in step (3); 

(6) For each of the W IDCT output pixels-in 10000 block 
sets, measure the peak, mean, and mean square errors 
between the "reference" and "test" data. 

The simulations of quantization error effects for proposed 
2-D IDCT algorithm are also carried out on 10,000 sets of 
randomly generated 8x8 input blocks. The 2-D IDCT output 
data are rounded to 9-bit integer through saturation control 
with ±05 adjustment (based on the ±sign of each number). 
Several important parameters, such as Peak Pixel Error, 
Peak Pixel Mean Square Error, Overall Mean Square Error, 
Peak Pixel Mean Error and Overall Mean Error, are sum- 
marized in Table 4.3 for total 10,000 sets simulation. The 
results calculated with 2-D IDCT formula in Eq. (233) have 
been used as "reference", and the input data range is from 
-256 to +255. 



TABLE 4.2 







Coefficient quantization effects For 2-DCT 










Peak 












Pixel 


Overall Peak 






Quantization 


Peat 


Mean 


Mean Pixel 


Overall 


Fixed- 


Length of 


Pixel 


Square 


Square Mean 


Mean 


point DC 


a* 


Error 


Error 


Error Error 


Error 


Error 



H.261 Spec 


^1 


^0.06 


^0.02 


^0.015 


£0.0015 




8-bit 


1.3425 


0.073803 


0.025681 


0.005270 


0.000109 


0 


9-bit 


0.8204 


0.039675 


0.011384 


0.003949 


0.000024 


0 


10-bit 


0.5130 


0.010145 


0.003495 


0.001760 


0.000125 


0 


11-bit 


0.2457 


0.002973 


0.001227 


0.001084 


0.000080 


0 


12-bit 


0.1137 


0.000653 


0.000445 


0.000690 


0.000014 


0 


13-bit 


0.0563 


0.000156 


0.000116 


0.000297 


0.000025 


0 


14-bit 


0.0400 


0.000068 


0.000041 


0.000155 


0.000003 


0 


15-bit 


0.0221 


0.000028 


0.000011 


0.000092 


0.000013 


0 


16-bit 


0.Q084 


0.000005 


0.000002 


0.000031 


0.000001 


0 



From above table, one can see that the computation 
accuracy of the proposed 2-D DCT algorithm drops gradu- 
ally when the coefficient representations are reduced from 
16-bit to 8-bit. And at least 9-bit coefficient representation 
for Q», which is equivalent to the 45 bits coefficient 
representation for each cos(ijt/16), is required in order to 
meet the 2-D DCT accuracy requirements by H.261. The 
graphic expressions of the simulation results are also illus- 
trated in FIG. 9. 

4.2.2 2-D IDCT Simulation Results 

In contrast with 2-D DCT, 2-D IDCT simulations of the 
proposed algorithm need to be carried out in much more 



From Table 4.3, a conclusion similar to the proposed 2-D 
DCT algorithm can be reached: the computation accuracy of 
the proposed 2-D IDCT algorithm drops gradually when the 
coefficient quantization precision is reduced from 16-bit to 

60 8-bit. And at least 9bit coefficient representation for Q if , 
which is equivalent to the 4.5 bits coefficient representation 
for each cos(iji/16), is required in order to meet the 2-D 
IDCT accuracy requirements by JPEG. The specification 

65 and the results for all required input data ranges, are illus- 
trated in Table 4.4. The graphic expressions of the simula- 
tion results are also illustrated in FIG. 10. 
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TABLE 4.3 



Coefficient quantization effects for 2-D IDCT 







Peak Pixel 


Overall 






Quantization 




Mean 


Mean 






Length of 


Peak Pixel 


Square 


Square 


Peak Pixel 


Overall 




Error 


Error 


Error 


Mean Error 


Mean Error 


JPEG Spec 


SI 


50.06 


c0.02 


£0.015 


S 0.0015 


8-bit 


1.7694 


0.156273 


0.151710 


0.005913 


0.000003 


9-bit 


0.4973 


0.011707 


0.011359 


0.002877 


0.000018 


10-bit 


0.2869 


0.003702 


0.003574 


0.001670 


0.000007 


11-bit 


0.1653 


0.001346 


0.001303 


0.000811 


0.000007 


12-bit 


0.1126 


0.000545 


0.000523 


0.000600 


0.000007 


13-bit 


0.0651 


0.000204 


0.000197 


0.000360 


0.000004 


14-bit 


0.0480 


0.000127 


0.000123 


0.000256 


0.000013 


15-bit 


0.0369 


0.000095 


0.000093 


0.000228 


0.000016 


16-bit 


0.0320 


0.000086 


0.000083 


0.000261 


0.000006 



TABLE 4.4 



Simulation results for 2-D IDCT with 9-bit Q,- 5 quantization 



Input Data 

Range 
(-L to +H) 


Peak Pixel 
Error 


Peak Pixel 
Mean 
Square 
Error 


Overall 
Mean 

Square 
Error 


Peak Pixel 
Mean Error 


Overall 
Mean Error 


JPEG Spec 


^1 


^0.06 


^0.02 


^0.015 


^0.0015 


L - 256, 


0.4973 


0.011707 


0.011359 


0.012877 


0.000018 


H-255 












L- 300, 


0.5642 


0.016135 


0.015625 


0.012206 


0.000010 


H - 300 












L - 5, H - 5 


0.0449 


0.000236 


0.000108 


0.011574 


0.000005 



4.3 Truncation Error Effects generated (in range of -256 to 255) W input data. All the 

35 parameters used in this section are the same ones used in 

Id order to determine the truncation errors for an example section 4.2.1. The simulation results and accuracy require- 

of an algorithm according to the present invention, different ments of H.261 for 2-D DCT are shown in Table 4.5. 



TABLE 4.5 



Finite wordlength truncation effects for 2-D DCT 
Peak 







Pixel 


Overall 


Peak 








Peak 


Mean 


Mean 


Pixel 


Overall 


Fixed- 


Finite 


Pixel 


Square 


Square 


Mean 


Mean 


point DC 


Wordlengths 


Error 


Error 


Error 


Error 


Error 


Error 


H.261 Spec 


=1 


§0.06 


§0.02 


§0.0015 


0.0015 




14-bit 


1.8750 


1.059886 


0. 097 640 


0.998788 


0.15678 


0 


15-bit 


0.6257 


0.021530 


0.019566 


0.004038 


0.000070 


0 


16-bit 


0.3233 


0.005382 


0.004862 


0.001518 


0.000078 


0 


17-bit 


0.1593 


0.001347 


0.001221 


0.001022 


0.000018 


0 


18-bit 


0.0802 


0.000338 


0.000307 


0.000476 


0.000018 


0 


20-bit 


0.0214 


0.000025 


0.000021 


0.000127 


0.000001 


0 


22-bit 


0.0092 


0.000006 


0.000003 


0.000089 


0.000001 


0 


24-bit 


0.0085 


0.000005 


0.000002 


0.000081 


0.000001 


0 


26-bit 


0.0084 


0.000005 


0. 000002 


0.000085 


0.000001 


0 


28-bit 


0.0084 


0.000005 


0.000002 


0.000062 


0.000001 


0 


30-bit 


0.0084 


0.000005 


0.000002 


0.000031 


0.000001 


0 



finite internal worldlengths are used in all arithmetic opera- 
tions whereas the Q tJ coefficient quantization is kept as fixed 
16-bit precision. For both the 2-D DCT and 2-D IDCT, the 
corresponding maximum finite wordlengths used in Eq. 
(3.52) are 30-bit. 

4.3.1 2-D DCT Simulation Results 

The finite wordlength simulations for the proposed 2-D 
DCT algorithm are carried out on 10,000 sets of randomly 



The simulation results clearly show that truncation errors 
do not have much effect until the finite wordlengths are 
reduced to less than 20-bit. Take parameter Peak Pixel Error 
for example, it has inverse linear relation with the finite 
wordlengths when they are equal or less than 20-bit, but it 
hardly changes when the finite wordlengths are more than 
20-bit. And at least 15-bit internal wordlength is required in 
order to meet the 2-D DCT accuracy requirements by H.261. 



60 



65 
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, Tbe^graphic -expressions, of ahe, simulation : resultsrare also- 
illustrated in FIG. 43. 
4.3.2 2-D IDCT Simulation Results 

The finite wordlength simulations of the proposed 2-D 
IDCT algorithm are also carried out with 10,000 sets of 
randomly generated 8x8 input blocks. All the parameters 
used in this section are the same ones used in section 4.2.2. 
The input data range is from -256 to +255. The same 
"reference*' and "test" results are used to calculate the error 
statistics. The simulation results and accuracy requirements 
of JPEG for 2-D IDCT are shown in Table 4.6. 



10 



46 



cation have different impact on the. accuracy,, of the proposed , 
algorithm. It can be seen in section 4.3 that arithmetic 
operations with 16-bit finite internal wordlength would keep 
the truncation errors low enough to meet both H.261 and 
JPEG's requirements, as long as the coefficient quantization 
errors are relatively small. Smaller finite wordlength means 
the complexity of hardware implementation can be reduced, 
which would result in the cost reductions for both chip 
design and implementation. It can be seen from section 4.2 
that the minimum coefficient representation for Q t y is 9-bit. 



TABLE 4.6 



Finite wordlength truncation effects for 2-D EDCT 



Peak Pixel 
Mean 

Finite Peak Pixel Square 
Wordlengths Error Error 



Overall 
Mean 

Square Peak Pixel Overall 
Error Mean Error Mean Error 



JPEG Spec 


^1 


^0.06 


S0.02 


^0.015 


^0.0015 


15-bit 


1.2864 


0.080420 


0.078200 


0.008168 


0.000000 


16-bit 


0.6514 


0.020007 


0.019454 


0.002817 


0.000000 


17-bit 


0.3277 


0.005132 


0.004934 


0.001865 


o.oooooo 


18-bit 


0.1529 


0.001261 


0.001217 


0.000626 


0.000000 


20-bit 


0.0607 


0.000175 


0.000169 


0.000406 


0.000009 


22-bit 


0.0363 


0.000091 


0.000089 


0.000341 


0.000001 


24-bit 


0.0323 


0.000086 


0.000084 


0.000329 


0.000007 


26-bit 


0.0320 


0.000086 


0.000083 


0.000319 


0.000005 


28-bit 


0.0320 


0.000086 


0.000083 


0.000229 


0.000005 


30-bit 


0.0320 


0.000086 


0.000083 


0.000261 


0.000006 



From Table 4.6, a similar conclusion as for the 2-D DCT 
algorithm can be reached: the computational accuracies of 
the proposed 2-D IDCT algorithm drop proportional to the 
finite wordlengths when they are equal to or less than 20-bit, 
which are illustrated in FIG. 4.4. When the coefficient 
quantization precision is 16-bit, all the arithmetic operations 
for the-proposed 2-D IDCT can have no more 16-bit finite - 
wordlength and the tput results can still meet JPEG W IDCT 
specification for all required input data ranges, which is 
illustrated in Table 4.7. 



The less bits we use to quantize coefficients Q^, the smaller 
number of hardwired adders are required for the pseudo- 
multipliers. 

Since the smallest finite wordlength possible for the 
proposed algorithm is 16-bit, the optimal combination 
should be X-bit quantization precision plus 16-bit finite 
wordlength for all arithmetic operations. The reductions of 
coefficient quantization precision should still guarantee the 
proposed algorithm to meet both the H.261 and JPEG's 
accuracy requirements. 



35 



40 



TABLE 4.7 



Simulation results for 2-D IDCT with 16-btt precision 
Peak Pixel Overall 



Input Data 




Mean 


Mean 






Range 


Peak Pixel 


Square 


Square 


Peak Pixel 


Overall 


(-L to +H) 


Error 


Error 


Error 


Mean Error 


Mean Error 


JPEG Spec 


^1 


^0.06 


g0.02 


£0.015 


£0.0015 


L = 256, 


0.6514 


0.020007 


0.019454 


0.002817 


0.000000 


H = 255 












L- 300, 


0.6871 


0.019927 


0.01 9489 


0.003108 


0.000000 


H - 300 












L = 5, H = S 


0.6472 


0.019110 


0.018542 


0.004196 


0.000000 



4.4 Combined Quantization and Truncation Error 
Effects 

The simulations carried out in section 4.2 and 4.3 show 
that the coefficient quantization and finite wordlength trun- 



Further simulation results, which are illustrated in Table 
4.8 and 4.9, show that the representation of coefficients Q t> 
with 13-bit precision is the lowest possible quantization 
65 precision required by the proposed algorithm to meet both 
the 2-D DCT and 2-D IDCTs accuracy requirements while 
the finite internal wordlength is kept as 16-bit. 
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TABLE 4.8 



Simulation results for 2-D DCT with 16-bit finite wordlength 
Peak 







Pixel 


Overall 


Peak 






Quantization 


Peak 


Mean 


Mean 


Pixel 


Overall 


Fixed- 


Length of 


Pixel 


Square 


Square 


Mean 


Mean 


point DC 




Error 


Error 


Error 


Error 


Error 


Error 


a 261 Spec 


£1 


^0.06 


^0.02 


£0.015 


^0.0015 




12-bit 


0.7834 


0.396600 


0.002001 


0.011940 


0.002814 


0 


13-bit 


0.3379 


0.005519 


0.004998 


0.003446 


0.000211 


0 


16-bit 


0.3233 


0.005282 


0.004862 


0.001518 


0.000078 


0 



TABLE 4.9 



Simulation results for 2-D I DCT with 16-bit finite wordlength 







Peak Pixel 


Overall 










Mean 


Mean 






Quantization 


Peak Pixel 


Square 


Square 


Peak Pixel 


Overall 


Length of fig 


Error 


Error 


Error 


Mean Error 


Mean Error 


JPEG Spec 


SI 




S0.02 


20.015 


g 0.0015 


12-bit 


2.3786 


4.091672 


0.058046 


0.033540 


0.004605 


13-bit 


0.6514 


0.020303 


0.019550 


0.004562 


0.000000 


16-bit 


0.6513 


0.019974 


0.019430 


0.004544 


0.000000 



The 13-bit representations for coefficients Q^- are shown rithm are carried out on 10,000 sets of randomly generated 
in Table 4.10. The maximum quantization error with 13-bit 30 (in range of -256 to 255) W input data. The simulation 
precision for all Q ff is 0.000065, which is corresponding to results are shown in Table 4.11. 

TABLE 4.11 



13-bit quantization & X-bit finite wordlength for 2-D DCT 







Peak 














Pixel 


Overall 


Peak 








Peak 


Mean 


Mean 


Pixel 


Overall 


Fixed- 


Finite 


Pixel 


Square 


Square 


Mean 


Mean 


point DC 


Wordlengths 


Error 


Error 


Error 


Error 


Error 


Error 


H.261 Spec 


SI 


£0.06 


S0.02 


£0.015 


£0.0015 




14-bit 


1.8750 


1.059886 


0.097852 


0.998788 


0.015908 


0 


15-bit 


0.6757 


0.021735 


0.019701 


0.003944 


0.000253 


0 


16-bit 


0.3379 


0.005519 


0. 004998 


0.003446 


0.000211 


0 


17-bit 


0.1679 


0.001949 


0.001337 


0.003547 


0.000179 


0 


18-bit 


0.1011 


0.000507 


0.000472 


0.003538 


0.000171 


0 


19-bit 


0.0894 


0.000354 


0.000279 


0.003512 


0.000172 


0 


20-bit 


0.0594 


0.000193 


0.000140 


0.003617 


0.000183 


0 



quantizing each single cos(iji/16) with maximum quantiza- 
tion error as 0.00806. 

TABLE 4.10 



13-bit representation of coefficient Q-^ 

(Hex) j-1 j-2 j-3 j-4 j-5 j-6 j-7 

i = l O.F640 0.E7F8 0.DOC0 O.B190 0.8B80 0.6010 0.3100 



= 2 0.DA80 0.C4A8 0.A740 0.8368 0.5A80 0.2E20 

-3 0.B100 0.9680 0.7640 0.5170 0.2988 

= 4 0.8000 0.6490 0.4548 0.2350 

= 5 0.4F00 0.3670 0.1BC0 

- 6 0.2580 0.1320 

- 7 0.O9C0 



The simulations with 13-bit quantization precision and 
different finite wordlengths for proposed 2-D DCT algo- 



50 The simulations with 13-bit quantization precision and 
different finite wordlengths for proposed 2-D IDCT algo- 
rithm are also carried out on 10,000 sets of randomly 
generated (in range of -256 to 255) 8x8 input data. The 
simulation results are shown in Table 4.12. 

55 

With 13 -bit quantization precision for Q tf (which is 
equivalent to quantizing each 

(2/i + ljfcr 
60 cos — — — 



with 6.5-bit) and 16-bit finite internal wordlength, the pro- 
65 posed 2-D IDCT algorithm can still meet JPEG's IDCT 
Specification for all required input data ranges, as illustrated 
in Table 4.13. 
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TABLE 4.12 



13-bit quantization & X-bit finite wordlenalh for 2-D rDCT~ 







Peak Pixel 


Overall 








Peak 


Mean 


Mean 






Finite 


Pixel 


Square 


Square 


Peak Pixel 


Overall 


Wordlcngths 


Error 


Error 


Error 


Mean Error 


Mean Error 


JPEG Spec 


£1 


£0.06 


£0.02 


£0.015 


£0.0015 


15-bit 


1.3027 


0.080976 


0.078392 


0.008551 


0.000000 


16-bit 


0.6514 


0.020303 


0.019550 


0.004562 


0.000000 


17-bit 


0.3475 


0.005224 


0.005059 


0.007610 


0.000000 


18-bit 


0.1847 


0.001385 


0.001342 


0.006544 


0.000000 


19-bit 


0.1504 


0.000567 


0.000530 


0.006323 


0.000005 


20-bit 


0.0845 


0.000324 


0.000283 


0.006498 


0.000018 



TABLE 4.13 



13-bit quantization and 16-bit finite wordlength for 2-D IDCT 







Peak Pixel 


Overall 






Input Data 


Peak 


Mean 


Mean 






Range 


Pixel 


Square 


Square 


Peak Pixel 


Overall 


(-L to +H) 


Error 


Error 


Error 


Mean Error 


Mean Error 


JPEG Spec 


Si 


£0.06 


£0.02 


£0.015 


£0.0015 


L - 256, 


0.6514 


0.020303 


0.019550 


0.004562 


0.000000 


H = 255 












L= 300, 


0.7060 


0.020230 


0.019642 


0.005796 


0.000000 


H-300 












L=5 f H = 5 


0.6472 


0.019110 


0.018542 


0.004196 


0.000000 



4.5 Comparison with Row-Column Method 

It is relatively difficult to analyze the quantization and 
truncation effects mathematically. But one can compare the 
error upper bounds of an example algorithm according to the 
present invention and a generic row-column decomposition 
2-D DCT/IDCT implementation. 

Take the 2-D DCT for example. Let's define the maxi- 
mum quantization error for cos(ijt/16) as A^ and truncation 
error as A r for any arithmetic operation, respectively. Since 
-the core component of the proposed algorithm is: 



where the second part of the error in A, is propagated from 
the first 1-D DCT computation. Since the output value of the 
first 1-D DCT is bounded by max(y)=2.%F2 max(x), the Eq. 
(4.5) can then be further extended as 

A^(2V2"x256xA ? +A (z )x8 + A r x8 < 4 - 6 ) 
= 2 13 % + 2 3 A a + 2 W A, + 2 4 Ao, 
= (1 + 1 / vT) - 2 14 A, + 2% -t- 2X 



i=0 i=0 

For input data range as -256^x^255, the output error is 
bounded to 

A 2 £(4x256x(A 9 ) 2 +A 1 )xl6=2 14 (A fl ) 2 +2 4 A ( (4.3) 

when Eq. (3.22) is used to compute u w . 

For generic row-column method, the 2-D DCT is decom- 
posed as two 1-D DCTs as 

(4.4) 

k-0 

7 

*=o 

By defining the truncation errors associated with any 
arithmetic operation in two 1-D DCTs as A^ and A rz , 
respectively, the output errors of these two 1-D DCTs are 
bounded to 

A y £(256xA <? +A^)x8=2 u A <7 +23A ry , and 

A,£(max(y)xA^+A ( >8+A y >c8 (4.5) 



It is caa be seen from Eq. (4 J) and (4.6) that the example 

algorithm of the present invention only requires one half of 
45 the coefficient quantization precision and 2 bits less internal 

wordlength required by the generic row-column method to 

achieve the same 2-D DCT accuracy. 
Compared with some existing row-column decomposition 

algorithms, which are all optimized to just meet the accuracy 
50 requirements of H. 261 and JPEG specifications, the example 

of the present invention requires the shortest quantization 

bits and internal operation wordlength. The comparison 

result is illustrated in Table 4.14. 

55 TABLE 4.14 



Comparison of finite wordlength in different algortihms 



60 



65 



Algorithm 


Function 


Max Cosine 


Max Internal 


Proposed 


Implementation 


Quantization 


Wordlength 


[Ura92] 


DCT/IDCT 


16-bit 


34-bit 


[CW95] 


DCT 


12-bit 


24-bit 


[CW95] 


IDCT 


13-bit 


27-bit 


[MW95] 


DCT/TDCT 


13-bit 


22-bit 


Proposed 


DCT 


6.5-bit 


15-bit 


Proposed 


IDCT 


6.5-bit 


16-bit 
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.*4.6, Summary.. 



52 



In this section, the impacts of coefficient quantization 
errors and truncation errors for an example 2-D DCT/IDCT 
algorithm according to the present invention have beeo 
examined. The computational accuracy decreases when the 
coefficient precision is reduced or the operation precision is 
reduced. But excessive coefficient quantization precision or 
finite wordlength will not result in too much accuracy 
improvement. 

For hardware implementations, a small finite wordlength 
and less bits of coefficients quantization are desired. Less 
coefficient precision means fewer additions or subtractions 
is needed to carry out the coefficient multiplications. Smaller 
internal wordlength means fewer bits in registers and sim- 
pler circuits are required to implement the example algo- 
rithm according to the present invention. 

Compared with row-column approach of 2-D DCT/IDCT 
implementations [Jan94, MW95, SL96 and WC95, etc.], the 
16-bit finite wordlength implementation of the example 
algorithm clearly shows that the example algorithm has a 
precision advantage — since it possesses only one quantiza- 
tion error and one truncation error instead of twos. The 
benefit of suffering only one coefficient quantization loss 
comes directly from the example algorithm where a direct 
2-D approach is used instead of 2-D row-column decompo- 
sition methods. This advantage has been verified through 
extensive simulations with different quantization and round- 
ing precisions. Furthermore, this high-accuracy can be 
achieved through straight forward computations and no 
artificial bias adjustments are required (as in [MW951). 

For one example implementation, a 13-bit coefficient 
quantization and 16-bit finite internal wordlength was 
chosen, which makes this example algorithm a perfect 



::. . candidate . for VLSI hardware implementations. .The ^16-bit * : 
finite wordlength is the shortest one adopted by any 2-D 
DCT/IDCT chip implementation so far [Jan94, MW95, 
SL96, WC95 and Lee97]. If only the 2-D DCT implemen- 

5 tation is desired, only 15-bit finite internal wordlength is 
required by the example algorithm of the present invention. 

5.0 Hardware Architecture Design 

io In this section, a hardware architecture for the a 2-D 
DCT/IDCT algorithm according to one embodiment of ths 
present invention is described. The general structures of 
each basic hardware component required by the implemen- 
tation is discussed. It is shown in this section that all the 

15 major execution units under this scheme are 100% sharable 
between the 2-D DCT and 2-D IDCT operations. 

5.1 Introduction 

20 From the example algorithm described in Section 3, one 
can see that the 2-D DCT can be computed directly through 
Eq. (3.14), (3.22), (3.52) and (3.25), while the 2-D IDCT can 
be computed through Eq. (3.14), (3.45), (352) and (3.49). 
All the other equations in Section 3 are intermediate results. 

25 If the 8x8 input data for both the DCT and IDCT is 
generalized as 



(5-1) 



30 





^01 ■ 


X(J7 








-*70 


X 7 1 - 





then Eq. (3.22) can be extended as 



X„ = (XI + 1 4 X3) T + 1 4 (X2 + 1 4 X4) T 

Xqq + X-io +■ X(fj + Xjj X\o + X$o +■ Xffj +■ X& . . . Xia + X40 + X37 + X47 

Xqi +• X^l +■ Xot + X76 Xn + Xfii + Xi6 +• X« . . . X31 + X41 + X^6 + Xa$ 

X03 4* ^73 4- X04 +• X^ X13 + X& + X14 4- X& . . . ^33 + X43 + X34 + X44 



(5.2) 



X_ + = (X/ - r 4 X3) T + l 4 (X2 - 1 A X4) T 

Xoo — Xso +■ X<f) - Xn Xio — X^o +- X07 - X& . . . X30 - X40 + X37 - X47 

*01 - *71 +■ *06 ~ *76 *U - *6l +• *I6 - • • - *31 ~ *4I + ^36 ~ 

^03 — ^73 + Xfy% - X] 4 X13 - Xfi3 + X14 - X& . . . X33 - ^43 + X34 - X44 



X r _ = (XI + 1 4 xSf - 1 4 (X2 4- 1 4 X4) T 

Xoo + X 10 - Xq7 +Xrr X10 + X*so - Xqi +■ X& . . . X30 + X40 - X37 + X47 

Xot +■ Xn - Xo<s + X76 Xn + Xet - X\g 4- X« ... X31 + X 4l - X^ + X& 

X<a +■ ^73 — Xqq +■ X74 X 13 + X& - X 14 +■ Xw . . . X33 + X43 - X34 + X44 



X_ = (XI - 7 4 Xj) r - 7 4 (X2 - 1 4 X4) T 

Xoo - *7o - Xffj + X77 X 10 - Xgo - X07 +• X 61 . . . X30 - X w - X37 +• X47 

^01 - Xn - X06 + X 16 X u - X 6i - X 16 4- X w ... X31 - X41 - X 36 +• 

X(Q - X73 - X04 + X^ Xi$ - X© - X14 +- X64 ... X33 - X43 - X34 + X44 
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r. ■ One, can- see from ;Eq- (5.2) that the, elementSrofvX^, -1 
X + _, and X__can be calculated from the input data as 

X -*^j.r^7-j.t ¥ ^j. lH~^7-j.7-i. 

for 0ii,j^3. By substituting X^, X_ >7 X+_and X__into Eq. 
(3.25), the 2-D DCT results can then be easily calculated 
through four 4x4 matrix multiplications as EX^E r , EX_^ 
0 T , OX^E 7 " and OX_O r . 

For the 2-D [DCT computation, if matrix X is used as an 
8x8 input block and matrix Z as output, Eq. (5.1) can also 
be substituted into Eq. (3.45) for 2-D IDCT computation as 



(5-4) 
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matrix multiplicationsomit (or sub-block- operator)^ Jn.order. *v 
to carry out four 4x4 matrix multiplications independently 
and concurrently, four sub-block operators 1320, 1330, 
1340, and 1350 are required by the algorithm in this embodi- 
5 ment. This shuffling device (or shuffler) 1310 is used by the 
2-D DCT operation to compute the matrix elements of 
matrices X^, X_ + , X + _and X__in Eq. (5.3). It can also be 
used by the 2-D IDCT operation to calculate the final output 
(5-3) Zij in Eq. (5.7) after the matrix multiplications E r X^E, 
10 ^X^O, O r X e Ji and r X oo 0 are computed. The general data 
flow and simplified architectures 1300, 1400 of above two 
components are illustrated in FIGS. 13 and 14, where the 
output data rate from each sub-block would be one sample 
per clock cycle for 2-D IDCT. 
is All the architectural considerations in this section serve 
one purpose — to develop a pipelined parallel architecture for 
the algorithm according to one embodiment of the present 
invention. The detailed structures of the key components — 
shuffle r and sub-block are described in sections 5.2 and 5.3. 
20 After that, some important auxiliary components are briefly 
discussed in section 5.4. In section 5.5, the general structures 
of 2-D DCT, 2-D IDCT and combined 2-D DCT/IDCT 
example implementations according to the present invention 
are derived. A summary of hardware architectural design is 
25 presented in section 5.6. 

5.2 Shuffler — AdditionySubtraction Shuffling 
Device 

la Eq. (5.3) and Eq. (5.5), each element on the left side of 
30 the equations can be computed by shuffled additions and 
subtractions with 4 input data. If the input data are gener- 
alized as "a", "b", "c" and "d", then the 4 output would be 

x^-a+b+c+d, 
35 x^a-b+c-d, 
x+j=a*>b-c-d, 
x_j-a-b-c+d 



The 2-D IDCT results can then be calculated by substi- 40 
tuting Eq. (5.4) into E r X M E, E r X^O, O r X^E and O t XJO 
in Eq. (3.49). And each element Zij of matrix Z is equal to 



[2yl = 



' E r X„E+E T X oe O + CfXnE + (fX^O 
E 7 X ee E - E T X ct O + CfX^E - <f X m O 
E 7 X ee E + E T X C€ 0- (fX^E - (f X^O 
E T X„E - E T X„0- <?X„E + <? X^O 



(5.5) 



Of course, one component for both the 2-D DCT and 2-D 
IDCT algorithms according to one embodiment of the 
present invention is still the 4x4 matrix multiplications unit: 



50 



^4x4 = ^4x4^4x4^4 

and 



(5.6). 55 



' = / , Z biiUki \ Cjl = Tj X ^A"*^ = Yi X "H'&^Jt) 
U=o ) (=o *=o 1=0 k=o 



60 



It can be seen from above Eq. (5.2) to Eq. (5.6) above that 
there are only two basic computation blocks for the algo- 
rithm in this embodiment. The first one can be realized by a 
shuffling device 1310 which carries out four pairs of 
additions/subtractions with a butterfly interconnection 
among each adder or subtracter. The second one is a 4x4 



65 



(5.7) 

The shuffling operations with a total of 6 additions and 6 
subtractions can be carried out with 4 adders 1510, 1530, 
1550, and 1560, 4 subtracters 1520, 1540, 1570, and 1580 
and a butterfly interconnection, as illustrated in FIG. 15. 
This arrangement requires that 4 input elements are fed into 
the shuffler in each cycle, and total 16 cycles are consumed 
for total 64 input data. 

To accommodate both the 2-D DCT and 2-D IDCT 
implementations, the input, output and internal wordlength 
of the shuffler should be 16-bit according to one example 
imp le me nation. If the implementation is for 2-D DCT only, 
"a", "b", "c" and "d" would be 9-bit and "x^", "x_ + ", "x^" 
and "x„" should be 11 -bit. 

5.3 Sub-block Operator — 4x4 Matrix 
Multiplications Unit 

For one example 2-D DCT algorithm according to the 
present invention, each term of EX.^_E r , EX_ + O r , OX+_ET 
and OX__O r (or E^E, E r X^O, O t X„E and CfXjO for 
the 2-D IDCT) is a "half-size** 4x4 matrix multiplication. As 
shown in Eq. (5.6), each matrix element can be decomposed 
as 16 multiplications plus 15 additions. For the sake of 
convenience, let*s denote EX ++ E r ,EX_ + O r , OX^E 7 " and 
OX__O r as the EE, EO, OE and OO sub-block and gener- 
alize all the input matrices X^, X_ + , X„_, X_, X^, X oir , X TO 
and X^ as a 4x4 matrix U. 

Looking at the EE sub-block V 4x4 =EU 4x4 E r , where 
matrices B and C in Eq. (5.6) are replaced by matrix E 
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-. J define&.irkEq.r(3 k !G)j< aruLeach ..element -v if of ,matrix<Y can - . The computation. in,(5. U)* cambe.further simplified^ ,j- 
be computed as 



^ll=« t/ +(« t/ »l) and 

^, ^ ^ ^ (5-8) ^ « w Q 22 -(jtll»l)+(jftl»4)+C« Af »7)+C« w »9) (5.12) 

where only four adders are required at this time. But if both 
adders and subtractors are allowed to be used by the pseudo- 
multiplier, (5.11) can be even further simplified as 



1 3 3 3 3 

M> i-0 /-0 A-0 



«*As2="*H*101»3) +(x!01»7) (5.13) 



"20^2^4 + "21^4^6 - "22^4^6 - "23^ ^4 + 



{W 2 ,W 4 ,W 6 } (assuming the signs are ignored temporally), 10 xi0i-« A/ f(« w »2) and 
the whole set of {e^ey,} only contains six distinct elements: 
{w 2 w 2 , w 2 w 4 , w 2 w 6 , w 4 w 4 , w 4 w 6 , w 6 w 6 }. And Eq. (5.8) 

can be extended as: where only two adders and one sub tractor are all it needs to 

compute u A/ Q 22 . Moreover, since only top 14 bits (12-bit for 

voo = uoow A w 4 + uoi W4 w 4 + Hoi™*™* + «03>V4 >v 4 + (5.9) 15 integer and 2-bit for decimal) of the multiplication result are 

u l0 w 4 w4+u n w 4 >v4+u l iw 4 w4 + ui)W4W4+ required to be passed on to each accumulator, some least 

U20^4w 4 + w 21 nvv 4 + « 22 H' 4W4 +U23W4W4+ significant bits can be skipped when computing expressions 

»3o»4>*4 + + + above > whicb results ^ a farther simplification. When imple- 

mentation is for the 2-D DCT only, the input for the 
voi =ito^w4+iioiW4w«-««w4w < -«ojW4Wi+ 20 multiplier should be 11 bits and only 12 bits (11 -bit for 

integer and 1-bit for decimal) output would be required. 

The 6-to-16 multiplexer 1620 shown in FIG. 16 can be 
decomposed as 16 6-to-l multiplexers (each for one 
accumulator), which not only greatly simplifies the design 
task, but also reduces the complexity of physical layout and 
placement. Each 6-to-l multiplexer is controlled by one set 
of mux-selector codes which instruct the multiplexer to 
«io^H'6 + «ii^H'2-« I 2^^ + «i3^H te+ multiplex one and only one multiplication result onto the 

« 20 w 2 h> 6 - u 2l w 2 w 2 + u u y^ ^ - u u >^w 6 - corresponding accumulator at each cycle. Take v 33 in Eq. 

u i0 ^6 + un^6-^2^6 + u i2 w 6 w 6 30 (5.3) f or example, the output of the corresponding 6-to-l 

multiplexer (or the input of the accumulator) would be 
If u w , 0^1, k^3, is fed into a computational unit in 16 u °°^6> u^Q^, u^Q^, u^Q^, u 10 Q 26 , u^Q^ u 12 Q 22 , 
cycles, then all the elements of V 4x4 can be calculated in u i3"26» u 2t>"26» ^h^zz* u 22 Q 22j u 23"2<s> u 3o"ee> u 3i"26> 
these 16 cycles by simply accumulating the multiplication u 32^2<5 aiK ^ ^^ee- 

result of the current input u w 1605 with one of the ±{w 2 w 2 , 35 Eacb accumulator 1630 in FIG. 16 is an adder/subtractor 
w 2 w 4 , w 2 w 6 , w 4 w 4 , w 4 w 6 , w 6 w 6 } provided that circuit which is controlled by a set of add_sub signals and forced 
delays are ignored. The whole EE sub-block 1320 can be to reset after u 33 Q.„, is processed. The set of add_sub 
realized by 6 multipliers 1610, one 6-to-l 6 multiplexer 1620 signals for matrix element v 33 in Eq. (5.8) is "+", "+", 
and 16 accumulators 1630 (one for each v, y , 0^ij^3). "+", <V\ "+", "+", "+", «-'», 

In each clock cycle one input u, arrives and the 6 40 and "+". The output data from each accumulator is 16 
multipliers 1610 are used* to compute the multiplications integer and 2-bit for decimal) for a combined 2-D 'DCT/ 
u «^22» u jw^24> u **^2<s> u «^44> UjyQ^ and Uj^Q^, where the IDCT implementation. The width can be reduced to 15 bits 
relation between and w,w y , is defined in Eq. (4.1). Then, (14-bit for integer and 1-bit for decimal) for a 2-D DCT only 
the 6-to-l 6 multiplexer 1620, which is controlled by a implementation. 

mux-selector signal 1615, will multiplex the 6 multiplication 45 E £ sub-block 1320 described above can also be used 

results 1625 onto the 16 accumulators 1630. Since only to compiltc E r X„E for the 2-D IDCT computation without 
u*A y instead of «A is computed in above, each accu- structural modification. To compute E^fi, matrices B 

mulator should be an adder/subtractor so that the ±u kl Q iJ can aQd c m ^ (5 6) WQuld bc kced fe matrix £ r iQStcad 
be handfcd by the accumulator as either adding or subtract- f £ whic ^ has ^ same ^ of ^ matrix E but 

me u«Q«. Based on the add_sub signal, the accumulator , * j -l. . ir 

adds oY subtracts the incoming data and gets reset every 16 50 locat !£ m dl£fe , reat P° s *° DS * nd wtl ? 8 

cycles. The general structure of the EE sub-block is Mus- Tbis simply implies that the set of mux-selector codes 

trated in FIG 16 being used to control the multiplexers and the sets of 

Since each fi„is a pre-calculated constant, the multipli- ^"Af'^i 15 f ° r * e " accumulators will be different, 
cation u^can'easily be implemented by a few hardwired ^ E0 > OEand OO sub-block operators 1330-1350 can 
adders r^Judo-multinliersY For example, let's look a t 55 be implemented with a similar sub-block architecture as EE. 

From Eq. (3.10) it is known that while the matrix E only 
consists of even coefficient elements w 2 , w 4 and w 6 , the 
matrix O only has odd elements w 1( w 3 , w 5 , and w 7 . So for 
Qrf-ojusohrf.noiioioiooooooob (5.10) the EO and OE sub-blocks, one of the coefficient in w^ w y - 

„_ * ^ . , i + j • 60 is from even coefficient set {w 2 , w 4 , w 6 } and another is from 

ine muiupucation result ot u^ can De calculated usmg M coeffident 

set {wj, w 3 , w 5 , w 7 }, and every element in 
only adders as the EO and OE sub-blocks is the sum of products of u w 

«*/&22-("a/»i)+("«»2) +(«w»4)+(«u»5)+(«w» 7 ) + multiplied with w 2 w 2 , w 2 w 4 , w 2 w 6 , w 3 w 2 , w 3 w 4 , w 3 w 6 , 

(u t/ »9) (5.ii) w 5 w 2 , w 5 w 4 , w 3 w e , w 7 w 2 , w 7 w 4 and w 7 w 6 , which means 

and totally five adders are needed to accomplish above 65 that each sub-block operator consists of total 12 multipliers, 
function, where the shift operations can be realized by the one 12-to-16 multiplexer and 16 accumulators; For the OO 
hardwired realignments. sub-block, both of the coefficients in w 4 w ; - are from odd 



adders (pseudo-multipliers). For example, let's look at 
u w^22 : ^22 can be appr 
quantization precision as 
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^^coefficient -s&U.{w a , w 3 ,..w 5 ^w 7 }y.and every : element in OCX . -,. v elements -1802, 1804^1806,^1808, arev.£ed : ointo Jhev.DGT; 

sub-block is the sum of products of u w , multiplied with shuffler 1310, in which four pairs of addition/subtraction 

w i w i> w i w 3> w i w 5> w i w 7> w 3 w 3 , w 3 w 5 , w 3 w 7 , w 5 w 5 , w 3 w 7 butterfly operations take place before the x^_, x_+, x + _and 

and w 7 w 7 , for a total 10 multipliers, one 10-to-16 multi- values are sent out. Then, the output data from the 

plexer and 16 accumulators. Running in parallel, 4 input 5 shuffler 1310 are fed into EE, EO, OE and OO sub-blocks in 

elements are fed into 4 sub-block operators in each cycle and parallel Within each sub-block, the multiplications of u^Q.. 

total of 16 cycles are needed to process the 64 input 2-D arc carncd out h Y pseudo-multipliers and the multi- 

IDCT data plication results will be multiplexed on to 16 accumulators. 

Together, there are total of 40 pseudo-multipliers, 1 T 11 ' 16 ^ dala SUb ? I ° ( ? "l^ by ^ 

6-10-16 multiplexer (or 16 6-to-l multiplexers), 2 12-to-16 10 [ atC * es 17 £° *? ^ n £ ^ f m P u i f dat » ^ave been processed 

u . . f ;~ . 1 n . , x ! in . 1 ^ 1 by the sub-blocks. Therefore, after the accumulators are 

multiplexers (or 32 12-to-l multiplexers) 1 10-to-l 6 mul- ^ {Q eacfa ^ b]o(k ^ be fead tQ ^ ^ 

Uplexer (or 16 10-to-l multiplexers) and 64 accumulators me next 8x8 mput data set. Tlie latched accumulation results 
required for all EE, EO, OE and OO sub-blocks. 

5.4 Auxiliary Components for 2-D DCT or IDCT 15 V V ^ ^ 

Implementations ' { * Cjt) 

In order to compute Zij in Eq. (5.7) for the 2-D IDCT, 4 
16-to-l multiplexers (or output-mux) 1710 are needed and ^ be mu i t i p i e xed out in parallel by four 16-to-l multi- 
each can multiplex one out of sixteen accumulator results in plexers (output-muxes 1720) in 16 clock cycles. Finally, 
each of EE, EO, OE and OO sub-blocks, respectively. The 20 a ftc r being further processed by four DCTs truncation & 
selected data (four in parallel) are then sent to the shuffler to saturation control units 1730, the DCT results 1852, 1854, 
carry out four pairs of addition/subtraction with a butterfly 1856, 1858 come out in parallel. The results for one set of 
interconnection before the final results Zij are generated. 8x8 input data come out 4 elements in one row for 16 
These 4 16-to-l multiplexers 1720 can also be used for the consecutive clock cycles, and each is a 12-bit long integer. 
2-D DCT hardware implementation so only 4 output words 25 By using adder, sub tractors, pseudo -multipliers, multi- 
would be sent out at each cycle and a balanced I/O scheme plexers and accumulators, the 2-D DCT can be implemented 
is achieved. by pure datapath (i.e. consisting of processing elements 

If a continuous input data stream is desired (i.e. no gap o^y), which is illustrated in FIG. 18. It is worth noting that 

between consecutive input blocks), an additional 64 latches n0 extra memory components are required in this concurrent 

1710 are needed for temporary holding the sub-block results 30 pipelined architecture. ^ 
for 16 cycles. The latches 1710 would be positioned between Io addition, the same circuit modules can be used to 

the accumulators and the output-muxes 17. As soon as 16 implement the 2-D IDCT algorithm, given that the shuffler 

consecutive accumulations are completed for the current W 1310 is moved below the output-muxes 1720 as shown in 

input data, the 64 latches would latch out the current data FIG- 19- 

being generated in the 64 accumulators and free them up to 35 Since me shuffler 1310, the EE, EO, OE and OO sub- 
carry out the accumulation task for next input data set. blocks 1320-1350, the latches 1710 and the output-muxes 

In this way, the data held in the latches 1710 would be 172 ° m all . shar ! d be < we ? n ^ 2 ~ D DC T and 2-D IDCT 

multiplexed out in 16 cycles by the above 4 16-to-l multi- implementations for the algorithm, a combined 2-D DCT/ 

plexers 1720. For a combined 2-D DCT/IDCT approach, m IDCT implementation can be easily constructed according to 

each latch is 16-bit long (for the 2-D DCT only case, the 40 a embodiment of the present invention. Four more 

1 * u * ic us 1 \ input-muxes 2010 and four more inter-muxes 2030 are 
latch is 15-bit long). r . , . , , , . ^ . 

„ ./v. 1 * i . j required to switch back and forth between the DCT data and 

Finally, each 16-bit output data is required to be truncated me mcj . data for ^ 2 . D DC T/IDCT implemen- 

audj saturated within the required data ranges of the 2-D ^ WCJ . ^ DCT truncation ^ saturation control 

DCT and 2-D IDCT. For the 2-D DCT computation, each 4 s unitcs (or cli s) can each ^ a clip per 1730 as described 

word being multiplexed out from the latches is truncated to abo ^ ramb ineT!rchitec1u^ 

the nearest integer at first, followed by a saturation operation J^BSSSient of the present invention is shown in FIG. 20. ' 
which clips the integer in the range of -2048 to +2047 ^ shown m mGS lg to 20 me 2 _ D DC T/ 

(12-bit long for each). For the 2-D IDCT, the 16-bit words, , Dcr does have a highly modular) regular 

which are coming out from the shuffler, are truncated to the 50 and concurrent architecture. All processing units in FIG. 20 

nearest integers before the integers .arc > saturated in the range are 10Q% ^ 2 _ D DCr ^ , he 2 . D IDCT 

of -256 to +255 (9-b.t longforeach). Four DCT s truncaUon opentioils . T i ie processing speed of each module is perfectly 

& saturation control units (or clippers) 1730 are needed for balanC(jd ^ fouf les de yQ rate B essing 

2-D DCT implementation wbde four IDCTs truncation & gx8 { data ^ 16 ks 0Q aye tfais j lined 

saturation control units 1730 are needed for 2-D IDCT ss implementation can achieve 4 samples/cycle throughput for 

computation One example implementation 1700 of the both (hc 2 _ D DCT ^ 2 . D , DCT a tions. 
auxihary modules is illustrated m FIG. 17. 

5.6 Summary 

5.5 Architectures for DCT IDCT and Combined fa ^ ^ h ^ i& implementation schemes for 

DC1/1DC1 6Q a 2 D DCTj a 2 D IDCT md a combined 2_D DCT/IDCT, 

Having described the shuffler 1310, the EE, EO, OE and which are all based on the algorithm of the present 

OO sub-blocks 1320-1350, the sixty four latches 1710, the invention, have been introduced. The key components of the 

four output-muxes 1720 and the four truncation & saturation implementation schemes include a shuffler 1310, subblock 

control units 1730 in previous sections, a 2-D DCT algo- operators 1320-1350 (which are made of 40 pseudo- 

rithm can now be constructed with a concurrent pipelined 65 multipliers, x-to-16 multiplexers and accumulators), accu- 

architecture 1800 according to a further embodiment of the mulator latches 1710, output-mux 1720 and output clippers 

present invention (FIG. 18). At each cycle, four input 1730. 
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scheme to accomplish the computations in Eq. (5.7). Since 
four adders and four subtractors are arranged in a two layer 
structure, the latency between the input and output might be 
two clock cycles. But it can be reduced to one cycle if fast 5 
adders and subtractors are used in order to decrease the 
system delay. 

The sub-block operators are definitely the most important 
components in this hardware implementation scheme. In 
each sub-block, several hardwired adders and subtractors 10 
can be used to compute the multiplication u^^w^wj, and 
the accumulation operations for each element 



15 



6.1 Introduction 



20 



30 



35 



can be carried out by a multiplexer and 16 accumulators. The 
design task can be greatly simplified because the EO and the 
OE subblocks have identical structure, as well as the 64 
accumulators in the four sub-blocks are also identical. It has 
been shown that there is do communication interconnections 
between the sub-blocks. Consequently, localized intercon- 
nection has been achieved with this architecture. 

In addition, by latching the results of the 44 matrix 25 
multiplications, each sub-block can be freed up to start 
process next set of input data as soon as the last input data 
in the current data set are processed, which results in twice 
as high the system throughput rate. Without the accumulator 
latches, each sub-block would have to pause for 16 cycles 
before it can start to process next set of input data, since it 
takes 16 cycles to retrieve all the 64 data in the accumulators 
by the output-muxes. 

The circuit synthesis will be relatively simple since an 
example implementation only uses adders, subtractors and 
multiplexers and all of them are 100% sharable between 2-D 
DCT and 2-D IDCT operations. The only overhead for the 
combined 2-D DCT/IDCT implementation is four input- 
muxes and four inter-muxes, which are used to switch back 
and forth between the DCT data and the IDCT data. The 40 
modularity, regularity, concurrency and localized intercon- 
nection of the architecture according to the present invention 
make it well suited for VLSI implementation. 

6.0 HDL Design and Synthesis for an Example 2-D 45 
DCT/IDCT Algorithm 

In this section, some modem VLSI design techniques 
have been used to implement an example 2-D DCT/IDCT 
algorithm according to the present invention. The circuit 50 
logic synthesis is carried out with Synopsys Design Com- 
piler® and the pre -layout gate-level simulation result shows 
that the implementation can achieve 800 million samples per 
second throughput with only 7-cycle circuit delay. 
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With the development of highly automated Computer 
Aided Design (CAD) systems for the digital VLSI circuits 
design, Application Specific Integrated Circuits (ASIC) have 
become a very attractive way to achieve performances 60 
approaching that of custom crafted custom VLSI with very 
modest costs. It seems clear that the high level aspects of 
VLSI CAD technology can be applied to the implementa- 
tions of some specific algorithms into VLSI circuits so that 
the design and development of the application specific 65 
processors will be greatly simplified. In fact, this approach 
can drastically reduce the risk and the cost to create appli- 



cation, specific^circuits .for; some^lgorithms .intended do *be 
implemented with VLSI technology. 

Recently, advances in integrated circuit technology have 
provided the ability to design and synthesize high-quality 
Application Specific Integrated Circuits that are commer- 
cially viable [Swa97, AB094]. The limitations of old sche- 
matic capture based design have become clearly evident 
with the increase of complexity of integrated circuits. High- 
level design (steps 2100-2120) and logic synthesis (step 
2130) have provided an answer to these limitations. High- 
level design methodology using hardware description lan- 
guages (HDLs), such as Verilog and VHDL, has emerged as 
the primary means to capture input functionality (step 2100) 
and deal with a range of design issues. These HDLs not only 
offer high-level descriptions of the circuits that match the 
required design specifications (step 2110), but also provide 
means to verify and synthesize the circuits with the help of 
Electronic Design Automation (EDA) tools. 

A methodical approach to system implementations can be 
formulated as a synthesis-oriented solution which has been 
enormously successful in the design of individual integrated 
circuit chips. Instead of using a specification as a set of 
loosely defined functionalities, a synthesis approach for 
hardware begins with systems described at the behavioral 
level and/or register transfer level (RTL) by means of 
appropriate procedural hardware description languages. In 
general, logic synthesis is the process of transforming func- 
tionality which is initially described in HDL to an optimized 
technology specific netlist. After the simulations of the 
high-level description are carried out (step 2120) to verify 
that the designed functionalities can be achieved, the high- 
level description would be further translated down to some 
detailed hardware which only consist of basic logic gates 
(step 2130). Thus the outcome of the logic synthesis (step 
2130) is a gate-level description that can be implemented as 
a single chip or as multiple chips (step 2140). The use of 
logic synthesis (step 2130) has made it possible to effec- 
tively translate designs captured in these high-level lan- 
guages to designs optimized for area and/or speed. In 
addition to enabling control over design parameters such as 
silicon real-estate and timing, logic synthesis tools facilitate 
the capture of designs in a parameterizable and re -usable 
form. Moreover, logic synthesis (step 2130) makes it pos- 
sible to retarget a given design to new and emerging 
semiconductor technologies. The synthesis-oriented 
approach, which is highlighted in FIG. 21, has been gaining 
wide acceptance in recent years. 

Using the synthesis-oriented approach, the circuit design 
and synthesis of an example algorithm according to the 
present invention are carried out in the following sections. In 
section 6.2, the circuit design specifications for major 
components, plus the necessary communication hand- 
shaking signals, are discussed in detail. Also the RTL 
programming and simulation result of the example algo- 
rithm are also carried out in the section. In section 6.3, logic 
synthesis of an example circuit implementation is performed 
and the synthesis result is verified by prelayout gate-level 
simulation. A brief summary of this section is presented in 
section 6.4. 

6.2 HDL Design and Simulation 

Generally speaking, high-level designs and simulations 
are often referred to as behavioral, because only the exter- 
nally visible behavior of component devices is specified, 
rather than the precise detail structure of the components. 
Behavioral description is usually used to initially verify the 
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^•functionalities ; and 'feasibility^of wthe.* originaUdesigns^In T 
contrast, register transfer level (RTL) designs and simula- 
tions are written in terms of structural elements such as 
registers, and the signals used can take on values of bytes, 
words, and so forth. Gate-level design and simulation are 
usually related to the circuits down to the detail of logic 
gates, and the unit of data on the signals is the bit (single 
register). 

Since detailed functionality and finite wordlength simu- 
lations have been carried out in Section 4, the behavioral 
design and simulation of the proposed algorithm can be 
skipped here. Instead, all the design and simulation efforts in 
this section will focus on the RTL description. Although 
HDL code can be behavioral or RTL, the latter is usually 
considered to be the synthesizable form of HDL code in the 
synthesis domain. 

Based on the architectural designs in Section 5, the key 
components of the example algorithm implementation will 
be captured with HDL codes, and then be simulated. In some 
cases, the HDL codes might have to be modified slightly in 
order to achieve more efficient synthesized gate combina- 
tions. 

6.2 HDL Design for Shuffler 

It can be seen from section 5.2 that the shuffler can be 
implemented with 4 adders and 4 subtracters positioned in 
two layers. The top layer consists of 2 adders and 2 
subtracters, and the bottom layer also consists of 2 adders 
and 2 subtractors. The functionality of the shuffler can be 
captured with Verilog hardware description language as 



// shuffler for proposed 2-D DCT/IDCT implementation 
// pipelined with 2 cycles delay 

module sbtuffler(inO,iiil ,in2,in3,clk^p,xmp,xpm,xrnrn); 



parameter 
input [width-l:0l 
input 

output [width-l:0] 
reg[width-l:0] 
reg jwidth-l:0] 
wire [width- 1:0] 
subtractors inferred 
wire [width- 1:0] 
wire [width- 1:0] 
wire [width- 1:0] 
wire [width- 1:0] 
wire [width- 1:0] 
wire [width- 1 :0] 
wire [width- 1:0] 
always @ (posedge elk) 
clock 

xpO <- #0.1 addl; 
results 

xpl <- #0.1 add2; 
cycle delay 
xmO <= #0.1 subl; 
xml <- #0.1 sub2; 
xpp <= #0.1 add3; 
results 

xmp <- #0.1 add4; 
2-cyclc delay 
xpm <» #0.1 sub3; 
xmm <- #0.1 sub4; 
end 

endmodule 



width - 16; 
in0,inl ^2,1113; 
elk; 

xpp^mpjXpm^cmm; 
xpOjXpl ,xm0,xml ; 
xpp f xmp,xpm f xmm; 
addl - inO + inl; 

add2 = in2 + inj; 
subl = inO - inl; 
sub2 - inO - inl; 
add3 =» xpO + xpl; 
add4 = xmO + xml; 
sub3 - xpO - xpl; 
sub4 «* xmO - xml; 
begin 



// internal wordlength 
// 4 input data 
// system clock 
// 4 output data 
// 4 intermediate data 
// registered output - 
// 4 adders & 4 



// at positive edge of each 
// top layer, intermediate 
// registered with 1 

// bottom layer, final 
// reg. output with 

// end of module shuffler 



.complexity, of* shuffler, module^ butit. also adds^ pipelinings 
ability to the module. This enables the processing rate to be 
linearly scaled up, since the processing elements "addl", 
"add2", "subl" and "sub2" can start to process the next 4 
5 data elements while the current 4 data elements are still 
being processed by elements "add3", "add4", "sub3" and 
"sub4". All the registers used in the shuffler module would 
eventually be replaced by positive edge triggered flip flops. 
6.2.1 HDL Design for Sub-block Operators 
10 It has been shown in section 5.3 that each sub-block 
operator consists of 6-12 pseudo-multipliers (in which each 
pseudo-multiplier can be implemented with a few hardwired 
adders/sub tractors), 16 x-to-1 multiplexers and 16 16-bit 
accumulators. Using 13-bit approximations for the coefE- 
15 cients as shown in Table 4.10, it can clearly be seen that 
it is always possible to implement any pseudo-multiplier 
with at most 7 adders and subtractors. If one arranges these 
7 adders and subtractors into a 3 layers configuration similar 
to the shuffler, then each pseudo -multiplier can also be 
20 implemented with a pipelined structure that has no more 
than 3-cycle circuit delay. The output results from the 
pseudo-multipliers will be multiplexed onto the 64 accumu- 
lators and one more cycle delay is allowed to be associated 
with multiplexing and accumulating operations. 
25 Since there are too many independent modules in the EE, 
EO, OE and OO sub-blocks, it would be impossible to 
present the HDL descriptions here for all of them. Therefore, 
only one pseudo-multiplier, one 6-to-l multiplexer and one 
16-bit accumulator from the EE sub-block are presented as 
30 examples and demonstrate their RTL descriptions. 

It has been shown in section 53 that multiplication of 
u */^22 can be carried out as 



35 



The functional description above uses 4 16-bit registers as 
intermediate storage to separate the top and bottom layers' 
arithmetic operations into separate clocks. In this way, only 
one addition or subtraction operation is executed per clock 
period. Not only does this approach simplify the synthesis 



jtl01=w Ar h(« A/ »2) 
"*A22=«i^Crl01»3) +(*101»7) 



(6.1) 



where 2 adders and 1 subtractor are needed to yield the 
multiplication result. The function thus can be captured with 
Verilog hardware description language as 



40 



// pseudo- multiplier for u« Q22 - 
// pipelined with 3 cycles delay 
module m_ww22(in,clk,out); 
parameter width — 12; 



u* 0.110110101b 



50 



input 

input 

output 

reg 

reg 

reg 

reg 

reg 

wire 

wire 



[width-l:0] in; 



elk; 
out; 
in_dl; 
xl; 

in__d2; 
x2; 
out; 



[width+l:0] 

[width-l:0] 

[width+2:0] 

[width-l:0] 

[width+2:0] 

[width-l:0] 
[width+2:0] addl = {in,2'b00} ■ 
[width-l:0] subl - xl[width+2.4>- 
xl; 

[width+2:0] add2 = {in_d2 r 3'b00l} 
+ x2[width+2:2]; 



// input wordlength 
// input data 
// system clock 
// output data 

// 1-cycle delayed input value 
// intermediate data 
// 2-cycle delayed input 
// intermediate data 
// registered output 



// x+(x»2) 
// (xl»4)-xl 



always @ (posedge elk) begin 



• clock 

xl <= 



// final result 
// with O.OOlbprcset 
// at positive edge of each 



#0.1 addl; 
in_dl <= #0.1 in; 
x2 <- #0.1 subl; 
in_d2 <= #0.1 in_d2; 
out <= #0.1 add2 [width+Zl]; 



//rcgistcredxlOl 

// 1-cycle delayed input value 

// registered (xl01»4)-xl01 

// 2-cyclc delayed input value 

// registered output 

// truncated to 12.2 



end 

endmodule 



// end of module m_ww22 



The registers "xl", "x2", "in_dl" and "in_d2" are 
employed in module "m_ww22" to hold the intermediate 
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i (multiplication results, .Avhich. -.makes ■ tte -pseudo-multiplier 
able to operate in a pipelined mode. 

The 6-to-l multiplexer, which is the second key compo- 
nent module in the EE subblock, can easily be coded with 
Verilog HDL since it consists of pure combinational logic. 5 
The six input data for the multiplexer come from six 
pseudo-multipliers in the EE subblock, and one and only one 
of them will be selected as module output based on the 
current mux-selector value. 

10 



// 6-to-l mux 

module mux_6_l(in0,inl, tr^inSjinA.inS ^nux_sel ) out); 

parameter width = 14; // input wordlength 

input [width-l:0] in0^1,in2,in3,in4,in5; //input data 

input [2:0] mux se]; // mux selector 

output [width-l:0] out; // output data 

reg [width- 1:0] out; 

always @ (mux__ J sel or inO or inl or in2 or in3 or in4 or in5) begin 
// uigged by any signal changes 
// synopsys full__case parallel_case 
// select inO as output mux__sel=0 



case (mux_sel) 
3-bOOO: 
313001: 
3-bOlO: 
3^011: 
313100: 
3*bl01: 

endcase 

end 

endmodule 



out <=in0; 
out <-inl; 
out <=in2; 
out o=in3; 
out <-in4; 
out o»in5; 



// select in5 as output if mux scl=5 



20 



25 



// end of module mux_6 1 



The 16-bit accumulator, which is the third key component 
module in the EE-sub-block, — would add/subtract the cur- 30 
rent input data into/from previous accumulated results stored 
in it. The accumulator has to be reset every 16-cycle in order 
to get a fresh start for next set of input data. The HDL code 
for this functionality is 



64 



subrb locks,,, which i : might, ,-be^. an : adder^- sub tractor^ OE/a^tf 
accumulator, is balanced since the computation time of each 
PE equals the I/O time required by the data transmission. 
Except in multiplexer modules, all the registers used in 
sub-blocks implementation will eventually be replaced by 
positive edge triggered flip flops. 

6.2.3 HDL Design for Auxiliary Components 

It has been demonstrated in section 5.4 that several 
auxiliary components are required by one example com- 
bined 2-D DCT/IDCT implementation. Among them, 64 
16-bit latches are used to latch and hold the sub-blocks' 
output data for 16 cycles when accumulator-ready-to-send 
signal is asserted. The 4 1 6-to-l output -multiplexers are 
used to select 4 latched data out in each cycle, and each of 
them can be coded similar to the 6-to-l multiplexer dis- 
cussed in section 6.2.2. Several truncation & saturation 
control units are required by both 2-D DCT and IDCT 
operations and each can be built with an incre mentor and 
simple clipping circuit. Moreover, the input-multiplexer and 
intermediate-multiplexer shown in FIG. 20 can be made up 
of a simple 2-to-l multiplexer controlled by DCT/IDCT 
selector input. The detailed HDL descriptions for the aux- 
iliary components are not included in this section because 
they are relatively simple as should be apparent to a person 
skilled in the art given this description. 

6.2.4 HDL Simulation for Combined 2-D DCT/IDCT 
The HDL simulation of a combined 2-D DCT/I DCT 

module, coded with Verilog hardware description language, 
is carried out with Cadence's Verilog-XL® simulation tool 
[Cad]. A "top'* module (i.e. stayed on top of the combined 
2-D DCT/IDCT module) is adopted as the testbench in the 
simulation process and behaves like a interface component 
to control the interaction between the tested module and its 
environment. The signal "sel_dct" is controlled by the "top" 
module to indicate whether DCT or LDCT computation is 



// 1 6-bit accumulator 
// 1 -cycle delay 

module accumuIator(in«sel > resct t clk 1 out); 
parameter width =• 14; 



input 
input 

selector 
input 
input 
output 
reg 
wire 
wire 

addition 



[width-l:0] 



[width+2:0] 
[width+2:0] 
[width+2:0] 
[wtdth+2:0] 



always @ (posedge elk) 
out <=» #0.1 acc; 
end 

endmodule 



in; 
sel; 

reset; 
elk; 
out; 
out; 

org » reset ? IffbO: out; 

acc » sel ? (org - in) : (org + in); 



begin 



// input wordlength 

// input data 

// addition/subtraction 

// accumulator reset 
// system clock 
// output data 

// previous result 
// if sel=0, 

// else, subtraction 

// at positive edge of each clock 

// registered output 

// end of module accumulator 



A finite state machine with 16 distinguish states is 
employed to hold the current state and generate the mux- 
selector, reset and add-sub selector signals. The values of 
mux-selector, reset and add-sub selector for 16 different 
states can be pre -calculated with Eq. (5.8) and Eq. (3.8). 60 

The detailed HDL coding for 4 sub-blocks reveals that a 
total of 82 adders and subtractors are needed to implement 
the 40 multipliers u JU £3 22 in the EE, EO, OE and OO 
sub-blocks, which is equivalent to average of 2.05 adders/ 
subtractors for each multiplication. In order to carry out the 65 
summation function in Eq. (5.8), 64 16-bit adder/sub tractors 
are also needed. Each processing element (PE) in these 4 



desired. After the system reset, the "top" module will start to 
feed input data into the tested module as soon as the input 
ready-to-receive signal "inp_rtr" is asserted by the tested 
module. It takes 16 cycles to complete the data feeding of 
one set of input data. The output ready-to-send signal 

"out rts" will be asserted by the tested module when the 

2-D DCT/IDCT computation for one set of input data is 
finished and the valid output results are available. Then, the 
"top" module starts to capture the output data from trie tested 
module and compare them with pre-calculated 2-D DCT/ 
IDCT results. 4 input data elements are being fed to and 4 
output element are being captured from the tested module in 
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..each-clockxycle. If. any* inconsistencies,- are, detected*. .the •, 
comparison error signal "cmp_err" will assert from low to 
high. 

The "top" module also generates a 200 MHz frequency 
system clock. The consecutive sets of input data could be 
loaded into the tested module continuously. In the 
simulation, consecutive data sets are separated by a random 
number of clock cycles in the range from 0 to 3. With a 200 
MHz clock, the maximum throughput of this combined 2-D 
DCT/1DCT module is 800 million samples per second for 
both DCT and IDCT operations. 

For 2-D DCT HDL simulation, 1000 sets of randon-fly 
generated 8x8 input data are used to assure the HDL coded 
2-D DCT functionality. The signal waveforms are illustrated 
in FIG. 22. Some handshaking signals are used to coordinate 
the operations among the submodules inside the 2-D DCT/ 
IDCT module. The signal "sel_dct" is asserted here to 
indicate that the desired operation is a 2-D DCT. 

In FIG. 22, several critical timing moments are appropri- 
ately labeled to indicate the following events: 

(a) System reset signal goes from high to low, which 
indicates that system reset process has just completed; 

(b) At the positive edge of the clock, the first 4 data from 
the first input data set are fetched into the tested 
module. A finite state machine will keep incrementing 
the current state "inp_st[3:01" and set input ready- to- . 
receive signal, "inp_rtr w , to high again when all 64 
input data from the input data set have been fetched. 
The data loaded in will go directly into the shuffler 
module. The next input data set will not be fetched into 
the tested system until both the tested module is ready 
to receive (signal "inp_rtr" is asserted) and the "top" 
module has input data ready to send (which is demon- 
strated by setting input ready-to-send signal "inp_rts" 
to high); 

(c) At the positive edge of the clock, the shuffler starts to 
transmit the shuffling results into 4 sub-block modules 
in 16 consecutive cycles. The shuffier ready-to-send 
signal, "shf_rts", was asserted before this rising edge 
of the clock, which indicates the shuffling results are 
available. The multiplier start-to-receive signal, "muL_ 
str", informs the pseudo-multiplier modules that input 
data U;y are coming; 

(d) At the positive edge of the clock, the pseudo- 
multipliers start to send out the multiplication results 
into the accumulators for 16 consecutive cycles. The 
multiplier ready-to-send signal, "mul_rts", indicates 
that the output data from the multipliers are valid at the 
rising edges of the clock when the signal "mul_rts" is 
asserted. Another finite state machine is employed here 
to track the current accumulator state "acc_st[3:0]", 
which increments by one after the positive edge of the 
clock to show that a accumulation operation has com- 
pleted; 

(e) At the positive edge of the clock, the last 4 data from 
the first input data set are fetched into the tested 
module. The input ready-to -receive signal "inp_rtr" is 
asserted again to indicate that the tested module is 
ready for next input data set; 

(f) At the positive edge of the clock, the first 4 data from 
the second input data set are fetched into the tested 
module. There is 0 cycle latency between the first and 
second input data sets; 

(g) At the positive edge of the clock, the current accu- 
mulator results are being latched into accumulator 
latches. The accumulator ready-to-send signal, "acc_ 
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.u'.v^h .rts'*i.>was aasserted«beforejithe«positive-. edge, w-: which . 
indicates that the accumulators latching should occur 
exactly at the positive clock edge; 

(h) At the positive edge of the clock, the 2-D DCT results 
based on the first input data set are available and will 
shifted out in total of 16 consecutive cycles as long as 
the output ready-to-send signal, "out__rts", remains 
high. The third finite state machine is employed here 
and the current state machine value "out st[3:0]" is used 
as the outmux selector to control the operation of the 
16-to-l output multiplexers. The 4 DCT output data in 
each cycle are compared to the pre-calculated DCT 
results and any inconsistency will trigger a comparison 
error signal "cmp_err". The comparison counter 
"crnp_snt" is be used to track how many comparisons 
have been made; 

(i) At the positive edge of the clock, the last 4 data from 
the second input data set are fetched into the tested 
module. The input ready-to-receive signal "inp_rtr" is 
asserted again to indicate that the tested module is 
ready for next input data set; 

(j) At the positive edge of the clock, the first 4 data from 
the third input data set are fetched into tested module. 
There is 1 cycle pause between the second and third 
input data sets; 
(k) At the positive edge of the clock, the 2-D DCT results 
based on the second input data set are coming out in total of 
another 16 consecutive cycles. 

Not a single comparison error has been detected in the 2-D 
DCT HDL simulation. 

For 2-D IDCT HDL simulation, 1000 sets of randomly 
generated 8x8 input data are used to assure the HDL coded 
2-D IDCT functionality, as well. The signal waveforms are 
illustrated in FIG. 23. The signal "sel_dct" is set low here 
to indicate the desired operation is a 2-D IDCT. 

In FIG. 23, several critical timing moments are appropri- 
ately labeled to indicate the following events: 

(a) System reset signal goes from high to low, which 
indicates that the system reset process has just com- 
pleted; 

(b) At the positive edge of the clock, the first 4 data from 
the first input data set are fetched into the tested 
module. A finite state machine will keep incrementing 
the current state "inp„st[3:0]" and set input ready-to- 
receive signal, "inp_rtr", to high again when all 64 
input data from the input data set have been fetched. 
The data loaded in will go directly into the sub-block 
modules. The next input data set will not be fetched into 
the tested system until both the tested module is ready 
to receive (signal "inp_rtr" is asserted) and the "top" 
module has input data ready to send (which is demon- 
strated by setting input ready-to-send signal "inp_rts" 
to high); 

(c) At the positive edge of the clock, the pseudo- 
multipliers start to transmit the multiplication results 
into the accumulators in 16 consecutive cycles. The 
multiplier ready-to-send signal, "mul__rts", was 
asserted, which indicates that the output data from the 
multipliers are valid at the rising edges of the clock. 
Another finite state machine is employed here to track 
the current accumulator state "acc_st[3:0]", which 
increments by one after the positive edge of the clock 
to show that an accumulation operation has completed; 

(d) At the positive edge of the clock, the last 4 data from 
the first input data set are fetched into the tested 
module. The input ready-to-receive signal "inp_rtr" is 
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^asserted:. again, ta-indicate^lhat. the tested, module ^s^i-rwTheifactUhaUmu^ 

ready for next input data set; an improvement in the overall design process in terms of 

(e) At the positive edge of the clock, the first 4 data from helping to manage the complexity and improving the pro- 
the second input data set are fetched into the tested ductivity and accuracy [KA95]. 

module. There is 0 cycle latency between the first and 5 In the synthesis process, the validated HDL code is first 

second input data sets; translated into a generic netfist and some Boolean expres- 

(f) At the positive edge of the clock, the current accumu- sions. The generic neffist consists of some basic elements 
la tor results are being latched into accumulator latches. which are described in a generic library. The concept of the 
The accumulator ready-to-send signal, "acc_rts w , was generic nedist allows for the original design to be trans- 
asserted before the positive edge, which indicates that 10 fonned or optimized depending on custon-dzed constraints, 
the accumulators latching should occur at the rising When instructed to reduce the number of gates count, the 
clock edge. The third finite state machine is employed entire set of Boolean equations is minimized by grouping 
here and the current state machine value "out_st[3:01" common terms and simplifying. If the idea is to achieve a 
is used as the out-mux selector to control the operations better control of the tin-ling performance, the levels of logic 
of the 16-to-l output multiplexers: ^ gate level are reduced even if this transformation implies 

(g) At the Positive edge of the clock, the shuffler is going duplicating certain expressions. Finally, the resulting 
to finish its first shuffling operation for sub-blocks* description has to be mapped onto a physical target library 
output based on the first input data set. The shuffler ( 0 r technology library). The result of the whole synthesis 
ready-to-send signal, "shf__rts", will change from low process has to fit the area and speed constraints specified by 
to high after this rising edge of the clock and will keep me designers. A successful HDL synthesis requires an 
asserted in total of 16 consecutive cycles, which indi- in-depth understanding of the synthesis process and the 
cates that the data are available for further process; operation of synthesis tools. There are several "do's and 

(h) At the positive edge of the clock, the 2-D IDCT results don > ts » that a designer must be aware of when coding in 
based on the first input data set are available and will HDL for synthesis [KA95, Syn96l. 

be shifted out in total of 16 consecutive cycles as long ^ HDL 2 _ D DCT/mcr [s synth e- 

as the output ready-to-send signal, out_rts , remams 2 $ . , „ , „ . ~ _ 4 { /ro J 

high. Hie 4 IDCT output data each cycle are compared ™* ™* Synopsys s Design Compiler® tool aSyn96]) 

to the pre-calculated IDCT results and any inconsis- ba ^ cd ° n a smglencdge triggered 2WMHz^stem clock. The 

tency will trigger the comparison error signal "cmp_ technology library used in the synthesis is TSMC s 3.3v 035 

err". The comparison counter "cmp_cnf is be used to ^ m CMOS library. Bottom-up plus timing-budget strategy is 

track how many comparisons have been made; 30 adopted in the synthesis process since the goal of this 

(i) At the positive edge of the clock, the last 4 data from synthesis process is not to generate the smallest nethst, but 
the second input data set are fetched into the tested to verifv me feasibility of the HDL coded 2-D DCT/IDCT 
module. The input ready-to-receive signal "inp_rtr" is module. Since the 200 MHz system clock is adopted in the 
asserted again to indicate that the tested module is synthesis, all the critical path delays within the module 
ready for next input data set; 35 shouId be less man 5 DS * 

(j) At the positive edge of the clock, the first 4 data from la me rest of this section, only the final synthesis result of 

the third input data set are fetched into the tested an example 2-D DCT/IDCT algorithm is presented, since the 

module. There is 1 cycle pause between the second and step-by-step synthesis process and the customized compiler 

third input data sets* constraints associated with it are simply too meticulous and 

(k) At the positive edge of the clock, the current accu- 40 tedious to be presented. But it should be pointed out that 

- mulator results are being latched into accumulator ! ven ^ advanced synthesis tools, ASCI synthesis is far 

latches. The accumulator ready-to-send signal, «acc_ from m automatic push-button process, but requires a fair 

rts", was asserted before the positive edge, which ^ mount of iterations To realize a high quality design, the 

indicates that the accumulators latching should occur must simultaneously consider both the coding of 

on the rising clock edge for the second input data set; 45 the design and the requirements for the logic synthesis. 

/i\a**u j c*u 1 1 *u mm^r i* Some HDL constructs synthesize more effectively than 

(1) At the positive edge of the clock, the 2-D IDCT results . ,. mr , j_. , • u* * 1 * *i 

based on the second input data set are available and will others > and ™ L ^ which might simulate correctly are 

be shifted out in total of 16 consecutive cycles as long not necessarily synthesizable. 

as the output ready-to-send signal "out-rts" remains The success of the undergoing synthesis process is greaUy 

50 facilitated by the simplicity, regularity, modularity and local 

Not a single comparison error has been detected in the 2-D connectivity of the algorithm according to the present inven- 

1DCT HDL simulation, either. U0D and lts exam P le HDL desi g°- The synthesis results for 

From the RTL simulation results shown in FIGS. 22 and each ma J or component in the 2-D DCT/IDCT implementa- 

23 one can see that the throughput rate for both the DCT and hon 310 summarized as follows: 

IDCT computations is 12,500,500 8x8 blocks per second, 55 0) Shuffler: consists of 4 adders and 4 subtractors with a 

which is equivalent to 800 MSample/s. butterfly connection. Simple ripple-carry adders and 

subtractors are selected by the Design Compiler® in 

6.3 Logic Synthesis for Example 2-D DCT/IDCT me synthcsis m order t0 rcduce the complexity and 

Algorithm silicon area of the design, provided the timing require- 

Logic synthesis is a process which is primarily intended 60 ments are met for each critical path. The circuit latency 

to be used in the design of digital integrated circuits, in associated with this module is 2 cycles. An 2 ns input 

particular, ASIC devices. In general, logic synthesis soft- delay constraint is applied to the 2 adders and 2 

ware tools are used with specialized languages such as subtractors in the top layer since the input data may also 

Verilog and VHDL to efficiently describe and simulate the come from the output-multiplexers in the 2-D IDCT 

desired operation of the circuit. If used properly, the syn- 65 case. The critical path delay reported here is 2.13 ns. 

thesis tools can then automatically generate the gate-level The total combinational and noncombinational area is 

logic schematics, based on appropriate HDL descriptions. reported as 10590 units (about 3530 gates); 
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- 5 ^(2).Sub.Tblock.Operatorsc-consist ) o£ 40. pseudormultipliers. 
(6 for EE sub-block, 12 for EO sub-block, 12 for OE 
sub-block and 10 for 00 sub-block), 16 6-to-l, 32 12-to 
1, 16 10-to-l multiplexers and 64 16-bit accumulators 
(adder/subtractor). Simple ripple -carry adders, subtrac- 5 
tors and adder/subtractors are selected by Design Com- 
piler® in the synthesis in order to reduce the complex- 
ity and silicon area of the design, provided that the 
timing requirements are met for each critical path. The 
latency associated with each pseudo-multiplier is 3 10 
cycles, and the latency associated with each combined 
multiplexer/accumulator is 1 cycle. The critical path 
delays for the pseudo-multipliers and combined 
multiplexer/accumulators are reported as 4.55 ns and 
2.15 ns/1.82 ns, respectively. The total combinational is 
and noncombinational area for total of 40 pseudo- 
multipliers is 51600 units (about 17200 gates). And the 
total combinational and noncombinational area for 64 
multiplexer/accumulators is reported as 126300 units 
(about 42100 gates); 20 
(3) Accumulator Latches: consist of 64 16-bit latches. The 
total combinational and noncombinational area is 
reported as 35840 units (about 11947 gates); 



-.,95,000 agates ^of-.the,: combined^IX-DCT/IDGL module.- 
consume about 5.7 mm 2 silicon area. If place-and-route 
factor, which is determined by a chip manufacturer as 1 .2 for 
this technology library, is taken into account in the area 
estimation, the estimated final chip size of the combined 2-D 
DCT/IDCT example implementation would be 6.84 mm 2 
(1.2x5.7 mm 2 ). 

The critical paths timing results achieved in the synthesis 
have indicated that the combined 2-D DCT/IDCT synthesis 
process is over-constrained in one example. A smaller chip- 
size can be achieved by relaxing some synthesis constraints 
or applying characterized constraints to each submodule and 
recompiling it with Design Compiler (using characterize and 
recompile approach). 

The main features of the example 2-D DCT/IDCT chip 
implementation, along with some other 2-D DCT or IDCT 
example chip implementations, are summarized in Table 6.1. 
It is worth notice that there is no intention make any direct 
comparison among the different implementations, since dif- 
ferent technologies (i.e. with different minimum channel 
widths) and clock rates are used by these implementations. 
And there are big trade-offis between the chip size and 
throughput and latency. 



. r~~ • ;-v*~+i*\tu\t- 'tt^u 



TABLE 6.1 





Feature summaries of some 2-D DCT/IDCT chips 








Core 












Tech- 


Area 


Clock 








Implementation 


nology 


(mm 2 ) 


Rate 


Throughput 


Latency DCT/IDCT 


[SL92] 


2 fan 


72.68 


50 MHz 


25 MHz 


>64 


Both 


[SL96] 


1.2 /nn 


240 


50 MHz 


50 MHz 


>64 


Both 


[Miy93] 


0.8 fan 


160.89 


50 MHz 


50 MHz 


>64 


Both 


[Ura92] 


0.8 fan 


21.12 


100 MHz 


100 MHz 


>64 


DCT only 


[MW95] 


0.8 fim 


10.0 


100 MHz 


100 MHz 


>64 


Both 


[RT92] 


1 fan 


110.25 


40 MHz 


160 MHz 


24 


IDCT only 


[CW95] 


0.8 fim 


81 


40 MHz 


320 MHz 


17 


DCT only 


Proposed 


0.35 fim 


6.8 


200 MHz 


800 MHz 


7 


Both 



(4) Output-Muxes: consist of 4 16-to-l multiplexers. The 
critical path delay for the 16to-l multiplexer is reported 
as 1.91 ns. The total noncombinational area for 4 
multiplexers is reported as 6096 units (about 2032 
gates); 

(5) Truncation & Saturation Control Units: consist of 4 
DCT truncation & saturation control units and 4 IDCT 
truncation & saturation control units. Simple ripple- 
carry incrementors are used for all truncation opera- 
tions. 1 cycle circuit delay is required in this stage. An 
2 ns input delay constraint is applied to the DCT 
truncation & saturation control modules since the input 
data for them come directly from output-multiplexers. 
The critical path delay for truncation and saturation is 
reported as (1.75+0.25)ns. The total combinational and 
noncombinational area is reported as 5998 units (about 
2000 gates); 

(6) Other Auxiliary Components: include 4 input- 
multiplexers, 4 intermediate multiplexers, several finite 
state machines and some handshaking control modules, 
etc. The optimization constraints for these modules are 
focused mainly on silicon area. The total combinational 
and noncombinational area for all these components is 
around 30000 units (about 10000 gates). 

Since the average silicon area for each logic gate is around 
60 |*m 2 for the TSMC's 0.35 fan technology library, the 



40 

However it is still worth while trying to scale the chip 
areas consumed among the different implementations to a 
relative neutral base depending on the technologies used, the 
clock speeds employed or the system throughputs achieved, 

45 etc. A relative chip area is used that can be calculated as the 
ratio of the core area over the smallest unit area employed 
by chip manufacturers (which is based on the channel width, 
for 0.81 technology, the unit area is 0.64). A relative- 
area/throughput is also defined as the ratio between the 

50 relative chip area and the system throughput of the imple- 
mentation. Using the parameters listed in Table 6.1, the new 
ratios can be generated shown in Table 6.2. 

TABLE 6.2 



2-D DCT/IDCT chip comparison sorted by relative-area/ 
throughput 

Core Relative- 





Implem- 


Tech- 


Through- 


Area 


Relative 


area/ 


DCT/ 


60 


entation 


nology 


put 


(mm 2 ) 


Area 


Throughput IDCT 




[Mty93] 


0.8 fim 


50 MHz 


160.89 


251.39 


502.78 


Both 




[SL96] 


1.2 jim 


50 MHz 


240 


166.67 


333.34 


Both 




[SL92] 


2 fim 


25 MHz 


72.68 


18.17 


72.68 


Both 




[RT92] 


1 /im 


160 MHz 


110.25 


110.25 


68.91 


IDCT 


65 














only 


[CW95] 


0.8 fim 


320 MHz 


81 


126.56 


39.55 


DCT 
















only 
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TABLE 6.2-continued 



2-D DCr/IDCT chip comparison sorted by relative-area/ 
throughput 



Implem- 
entation 



Tech- 
nology 



Through- 
put 



Core 
Area 
(mm 2 ) 



Relative 
Area 



Relative- 
area/ DCI7 
Throughput [DCT 



[Ura92] 0.8 /mi 100 MHz 21.12 33.00 



[MW95] 
Pro- 
posed 



0.8 Am 
0.35 jna 



100 MHz 
800 MHz 



10.0 
6.8 



15.63 
55.51 



33.00 



15.63 
6.94 



dct 

only 
Both 
Both 



10 



Moreover, the gate-level HDL simulation is also per- 
formed. The gate-level HDL code, which is captured in the 
combined 2-D DCT/IDCT logic synthesis process, has been 
used to replace the RTL description used in section 6.2. The 
requirements for 2-D DCT/IDCT gate-level validation are 
exactly the same as the RTL's, and 2-D DCT/IDCT output 
signals are also compared with the pre -calculated results. It 
has been concluded from the gate-level simulation that the 
original desired functionality of the proposed algorithm has 
been achieved by the pre-layout gate- level HDL code. 

6.4 Summary 

In this section, a modern synthesis-oriented ASIC design 
approach has been applied to the implementation of an 
example 2-D DCT/IDCT algorithm. 

The HDL design of the proposed algorithm starts by 
coding its functionalities in the Verilog hardware description 
language. Detailed Verilog codes for some, key components 
of the proposed algorithm have been included in this section. 
By using a RTL description to precisely model the designed 
circuit, the captured 2-D DCT/IDCT functionality can be 
simulated with Cadence's VerQog-XL® simulation tool. 

The logic synthesis of the combined 2-D DCT/IDCT 
module has been carried out based on the validated RTL 
code. The technology library used in the synthesis process is 
TSMC's 3.3 v 0.35 gm CMOS library. And bottom-up plus 
timing-budget strategy is adopted in the synthesis process. 
By using Synopsys Design Compiler® tool, the HDL coded 
2-D DCT/IDCT module is synthesized based on a single - 
edge triggered 200 MHz system clock. The structural HDL 
code of combined 2-D DCT/IDCT module is generated from 
the logic synthesis process, and the gate-level simulation 
with the generated structural code has proved that the 
original desired functionality of the algorithm has been 
respected. 

Moreover, an estimated 6.84 mm 2 chip area and 800 
million samples per second throughput rate for both the 2-D 
DCT and IDCT definitely show a competitive edge over all 
other existing 2-D DCT/IDCT chip implementations. Each 
of the 2-D DCT and IDCT example implementations of the 
present invention described above are illustrative and not 
necessarily intended to limit the present invention. 

7.0 Conclusions 

Id this section, some of the major contributions of this 
invention are summarized. MPEG-2 video encoding and 
decoding applications are also presented. 

7.1 Contributions of this Invention 

It has been shown in Section 1 that all the MPEG video 
codec implementations so far are limited to MPEG-1 or 
MP@MLof MPEG-2 specification. The major obstacles for 
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.realtime ^encoding/decodings compliance^with ..MPEG-2 
MP@H-14, MP@HL, HP@H-14 and HP@HL specifica- 
tions lie at the huge intensity of computation and very 
complicated structure required. Targeted to deal with these 
problems, the hybrid MPEG-2 video codec implementation 
scheme according to on embodiment of the present 
invention, which is based on the fact that the MPEG-2 
encoding/decoding process can be classified as parallel 
operation and serial operation in nature, takes full advantage 
of both powerful processing ability of hardware and maxi- 
mum flexibility of software. 

One of the most important normative components of 
MPEG-2 codec is 2-D DCT/IDCT operation. For 2-D DCT/ 
IDCT hardware developments, all the 2-D DCT or IDCT 
processors developed so far have made use of the separa- 
bility property of the 2-D DCT or IDCT, i.e. based on 
Row-Column Method (RCM) approach, the 2-D DCT/IDCT 
operation is decomposed as two separated 1-D DCT/IDCTs 
and each. 1-D DCT/IDCT would be further realized by 
relatively simple, regular structure with relative low design 
and layout cost. But the chip implementations based on 
RCM have a major drawback: throughput rate of them are 
relatively low because intermediate memory component and 
serial I/O mode are adopted in almost all these RCM 
approaches. The maximum throughputs of 2-D DCT/IDCT 
processors based on RCM so far are about 100 million 
samples per second, which fail to meet the 2-D DCT/IDCT 
throughput requirements for MPEG-2 MP@H-14, MP@HL, 
HP@H-14 and HP@HL specifications, which demand more 
than 100 million samples per second throughput rate for 2-D 
8x8 DCT or IDCT. On the contrary, the 2-D DCT/IDCT 
algorithm and its hardware implementation according to the 
present invention can achieve much higher throughput rate 
by adopting paralleled I/O mode and pipelined architecture 
without using any intermediate memory component. Based 
on a direct 2-D coefficient matrix factorization/ 
decomposition approach, this algorithm of the present 
invention is not only more computation efficient and 
requires a smaller number of multiplications, but also can 
achieve higher computation precision with the shorter finite 
internal wordlength compared with other RCM algorithms. 
In addition, one example hardware implementation of this 
algorithm only requires simple localized communication 
interconnection among its processing elements instead of 
complex global communication interconnection that pre- 
vents other, direct 2-D DCT/IDCT algorithms from being 
implemented in VLSI chips. The HDL simulation and logic 
synthesis results show that one example hardware imple- 
mentation of this algorithm is one of the first successful 
attempts to map a direct 2-D DCT/IDCT algorithm onto 
silicon. 

In summary, the algorithm and its hardware implementa- 
tion according to the present invention have some advan- 
tages over other 2-D DCT/IDCT processors: 

(1) The development of this algorithm and its hardware 
implementation, in one embodiment, is targeted to the 
application of MPEG-2 MP@H-14, MP@HL, HP@H- 
14 and HP@HL specifications. By employing the con- 
current pipelined architecture, the synthesized chip 
implementation can achieve 800 million samples per 
second (i.e. 12,500,500 8x8 blocks DCT or IDCT per 
second) high throughput when driven by a single -edge 
triggered 200 MHz clock, which guarantees it to be 
able to meet any current or near future HDTV require- 
ments; 

(2) The relatively small latency (only a few cycles circuit 
delay) makes it a perfect candidate for real-time video 
applications; 
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tun '.(3) viThe, direct^ JX^coefficienU matrix- Jt factorizationA;. : .• - 
decomposition approach allows the 2D DCT/IDCT 
operation to be carried out with only one coefficient 
approximation and one truncation losses. Compared 
with row/column transforms of two 1-D DCT/IDCT 5 
calculations, the accumulative errors have been cut 
dramatically so as to achieve much better precision 
(about 40 dB mean SNR) with the same or shorter finite 
internal wordlength; 

(4) No real multiplier is required since every multiplica- 10 
tion in this scheme has been confined as a variable 
multiplied by a constant instead of two variables in 
general. In contrast to that an ordinary n-bit by n-bit 
multiplier consists of n(n-2) full adders, n half adders 
and n 2 AND gates, each n-bit adder only requires n full *5 
adders. As the result, the 2-D DCT/IDCT processor in 
one embodiment is only made of some very simple 
components like adders, subtracters, adder/subtractors 
and multiplexers, which makes it very easy to be 
synthesized and implemented with VLSI technology; 20 

(5) All the processing elements (PEs) in one example 2-D 
DCT/IDCT chip implementation are 100% sharable 
between 2-D DCT and 2-D IDCT operations. 

The key execution units of one example 2-D DCT/IDCT 
chip implementation according to the present invention 25 
consist of total 90 adders and subtractors, as well as 64 
adders/subtractors as accumulators. Based on TSMC's 3.3 v 
0.351 CMOS library, the estimated chip area of this 
example 2-D DCT/IDCT implementation is about 6.84 
mm 2 . 30 

7.2 Other Applications 

The contributions of this invention can be combined with 
future works on hybrid MPEG-2 video codec implementa- 
tion. The present invention can be combined with any 
conventional and further implementations of quantization 
and inverse quantization in a MPEG encoding/decoding 
process. Quantization and inverse quantization can be inte- 
grated into the 2-D DCT/IDCr chip since both of them are ^ 
parallel operations in nature. 

Similarly, the present invention can be combined with 
conventional and future developed motion estimation in a 
MPEG encoding process. Motion estimation can be com- 
bined with the present invention to be carried out by 45 
hardware component. 

Further simplification can also apply to the 2D DCT/ 
IDCT algorithm of the present invention. Since the even 
rows of coefficient matrix E are still even symmetric and the 
odd rows are odd-symmetric, further decomposition for the 50 
algorithm can be carried out to further reduce the total 
number of multiplications required to compute the 2-D 
DCT/IDCT. 

For some MPEG video codec applications, 400 million 
samples per second throughput rate for 2-D DCT/IDCT is 55 
more than enough. The preliminary result shows that after 
minor architecture change for the proposed HDL 
functionality, the estimated chip area can be reduced to 
about Vi of its original size when the RTL design is 
re-synthesized with a single-edge triggered 100 MHz system 60 
clock instead of the 200 MHz one. And the 100 MHz driven 
the processor still has higher throughput than most other 2-D 
DCT/IDCT chip implementations. 
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What is claimed is: 

1. A system for computing a two-dimensional discrete 
cosine transform (2D-DCT) of input data, the input data 
including at least first through sixty-fourth input elements of 
an 8 rowx8 column matrix X, comprising: 

a shuffler having four parallel inputs and four parallel 
outputs, each input receiving a respective group of 
sixteen rows of the first through 64th elements, wherein 
said shuffler processes each row of four parallel input 
elements (a,b,c,d) received on said four inputs in par- 
allel and outputs four output elements (x++,x-+,x+-, 
x — ) in parallel for each processed input row, each 
output element representing a different linear combi- 
nation of four corresponding input elements; and 
first through fourth sub-block operators (EE, EO, OE, 
00) each coupled in parallel to a respective one of said 
four outputs of said shuffler; wherein said first through 
fourth sub-block operators each process a set of sixteen 
output elements from said shuffler independently, and 
generate respective first through fourth sets of sixteen 
matrix products, each set of matrix products represent- 
ing a product of three independent 4x4 matrix multi- 
plications of a respective set of said sixteen output 
elements output from said shuffler. 

2. The system of claim 1, wherein said shuffler calculates 
first through fourth output elements for each row of first 
through fourth input elements such that: 

said first output element (x++) equals a sum of said first 

through fourth input elements (a+b+c+d); 
said second output element (x-+) equals a sum of said 
.. first . input element minus said second input elements 
plus said third input element and minus said fourth 
input element (a-b+c-d); 
said third output element (x+-) equals a sum of said first 
input element plus said second input elements minus 
said third input element and minus said fourth input 
element (a-b+c-d); and 
said fourth output element (x — ) equals a sum of said first 
input element minus said second input elements minus 
said third input and plus said fourth input element 
(a-b-c+d). 

3. The system of claim 2, wherein said shuffler comprises 
first through fourth adders and first through fourth subtrac- 
tors interconnected in two layers, each adder and subtracter 
has two inputs and an output; wherein: 

said first adder inputs receive said first and second input 
elements and said first adder output is coupled to said 
third adder and said third subtracter; 
said first subtracter inputs receive said first and second 
input elements and said first subtracter output is 
coupled to said fourth adder and said fourth subtracter; 
said second adder inputs receive said third and fourth 
input elements and said second adder output is coupled 
to said third adder and said third subtracter; and 
said second subtracter inputs receive said third and fourth 
input elements and said second subtracter output is 
coupled to said fourth adder and said fourth subtracter. 
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iAitThe-. t system .of claim X,^wheremuSaid;Shu01erv i in£lude& ) *r 
at least four adders and four subtracters. 
5. The system of claim 1, wherein 
said first sub -block operator (EE subblock) outputs said 
first set of sixteen matrix products equal to a 4x4 matrix 
Zl; 

said second sub-block operator (EO subblock) outputs 

said second set of sixteen matrix products equal to a 

4x4 matrix Z2; 
said third sub-block operator (OE subblock) outputs said 

third set of sixteen matrix products equal to a 4x4 

matrix Z3; and 
said fourth sub-block operator (00 subblock) outputs said 

fourth set of sixteen matrix products equal to a 4x4 

matrix Z4; where 
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= EX. + O r , 



-- OX+.E T , 



= OX—0 T ; and 



4x4 matrix E has only odd coefficients of said coefficient 
vector W and 4x4 matrix O has only odd coefficient of 
said coefficient vector W as follows: 



= P2AP4, 
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where said coefficient vector W consists of coefficients 
Wj. is proportional to cos(kji/16) for k=<l,2, . . . 7. 

6. The system of claim 5, wherein: 

said first sub-block operator (EE subblock) comprises six 
multipliers; sixteen accumulators; and a 6 to 16 mul- 
tiplexer controlled by a first mux-selector signal, said 6 
to 16 multiplexer being coupled between each of said 
six multipliers and each of said sixteen accumulators. 

7. The system of claim 6, wherein: 

said second sub-block operator (EO subblock) comprises 
twelve multipliers; sixteen accumulators; and a 12 to 16 
multiplexer controlled by a second mux -selector signal, 
said 12 to 16 multiplexer being coupled between each 



:ft of i said . .twleve-. multipliers- , and - each: of , said r. sixteen : .* 
accumulators; 

said third sub-block operator (OE subblock) comprises 
twelve multipliers; sixteen accumulators; and a 12 to 16 
5 multiplexer controlled by a third mux-selector signal, 
said 12 to 16 multiplexer being coupled between each 
of said twelve multipliers and each of said sixteen 
accumulators; and 
said fourth sub -block operator (00 subblock) comprises 
10 ten multipliers; sixteen accumulators; and a 10 to 16 
multiplexer controlled by a fourth mux-selector signal, 
said 10 to 16 multiplexer being coupled between each 
of said ten multipliers and each of said sixteen accu- 
mulators. 

8. The system of claim 7, wherein each multiplier com- 
prises a psuedo-multiplier. 

9. The system of claim 1, further comprising: 

first through fourth output stages coupled to receive 
outputs from said first through fourth sub-block 
operators, respectively; 
20 each of said first through fourth output stages comprising: 
a plurality of latches; 
a 16 to 1 multiplexer; and 

a clipper; said 16 to 1 multiplexer being coupled 
between each latch and said clipper. 
25 10. The system of claim 9, wherein said clipper comprises 
a truncation and saturation control unit 

11. The system of claim 1, wherein said first through 
fourth sub-block operators each have a 13-bit coefficient 
quantization and a 15-bit finite internal wordlength. 
30 12. The system of claim 1, wherein the input data com- 
prises video data compressed according to at least one of a 
MPEG and JPEG standard. 

13. A system for computing a two-dimensional inverse 
discrete cosine transform (2D-IDCT) of input data, the input 

35 data including at least first through sixty-fourth input ele- 
ments of an 8 rowx8 column matrix X ijy where 0^ij^3, 
comprising: 

a multiplexer that divides the input data into first to fourth 
4x4 sub-matrices Xee, Xoe, Xeo, and Xoo based on 
whether each element has an even or odd row and 
column coefficient such that 16 elements having an 
even row and an even column are included in sub- 
matrix Xee, 16 elements having an odd row and even 
column are included in sub-matrix Xoe, 16 elements 
45 having an even row and odd column are included in 
sub-matrix Xeo, and 16 elements having an odd row 
and odd column are included in sub-matrix Xoo; and 
first through fourth sub-block operators (EE, EO, OE, 
00) receiving said first to fourth 4x4 sub-matrices Xee, 
50 Xoe, Xeo, and Xoo, respectively, said sub-block opera- 
tors processing said first to fourth 4x4 sub-matrices 
Xee, Xoe, Xeo, and Xoo independently and generating 
respective first through fourth sets of sixteen matrix 
products, each set of matrix products representing a 
55 product of three independent 4x4 matrix multiplica- 
tions of a respective set of said sixteen input elements 
in said input data. 

14. The system of claim 13, further comprising: 

a shuffler having four parallel inputs and four parallel 
60 outputs, each input coupled in parallel to receive an 
output from a respective one of said first through fourth 
sub-block operators. 

15. The system of claim 14, wherein said shuffler includes 
at least four adders and four subtracters. 

65 16. The system of claim 14, wherein said shuffler outputs 
64 elements of an output matrix Z representing a 2D IDCT 
transform of the input data of matrix X. 
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17v cThe., .systemn.of - claim^l6 t .wherein « output » matrix > Z . 
includes four 4x4 submatrices 215 to Z8, said shuffler out- 
putting first to fourth sets of 16 elements in the submatrices 
Z5 to Z8 on said four outputs of said shuffler in parallel, 
where said submatrices Z5 to Z8 are defined by: 



Z5= „ , ^ = E T X„E + E T X ot O +- O r X eo E +• 0 T K> o 0, 
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and 
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4x4 matrix E has only even coefficients of a coefficient 
vector W and 4x4 matrix O has only odd coefficients of 
said coefficient vector W as follows: 
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where said coefficient vector W consists of coefficients 
w* is proportional to cos(kjt/16) for k=*l,2, ... 7. 

18. The system of claim 13, wherein: 

said first sub-block operator (EE subblock) comprises six 
multipliers; sixteen accumulators; and a 6 to 16 mul- 50 
tiplexer controlled by a first mux-selector signal, said 6 
to 16 multiplexer being coupled between each of said 
six multipliers and each of said sixteen accumulators; 

said second sub-block operator (EO subblock) comprises 
twelve multipliers; sixteen accumulators; and a 12 to 16 55 
multiplexer controlled by a second mux-selector signal, 
said 12 to 16 multiplexer being coupled between each 
of said twleve multipliers and each of said sixteen 
accumulators; 

said third sub-block operator (OE subblock) comprises 60 
twelve multipliers; sixteen accumulators; and a 12 to 16 
multiplexer controlled by a third mux-selector signal, 
said 12 to 16 multiplexer being coupled between each 
of said twelve multipliers and each of said sixteen 
accumulators; and 65 

said fourth sub-block operator (OO subblock) comprises 
ten multipliers; sixteen accumulators; and a 10 to 16 



*s multiplexer controlled by a fourth.mux-selectorsignal,- 
said 10 to 16 multiplexer being coupled between each 
of said ten multipliers and each of said sixteen accu- 
mulators. 

19. The system of claim 18, wherein each multiplier 
comprises a psuedo-multiplier. 

20. The system of claim 13, further comprising: 

first through fourth output stages coupled between said 
first through fourth sub-block operators, respectively, 
and inputs of said shuffler, each of said first through 
fourth output stages comprising: a plurality of latches; 
and a 16 to 1 multiplexer, and 

first through fourth clippers coupled to said first through 
fourth outputs of said shuffler. 

21. The system of claim 20, wherein each clipper com- 
prises a truncation and saturation control unit. 

22. The system of claim 13, wherein said first through 
fourth sub-block operators each have a 13-bit coefficient 
quantization and a 16-bit finite internal wordlength. 

23. The system of claim 13, wherein the input data 
comprises video pictures compressed according to at least 
one of a MPEG and JPEG standard. 

24. A hybrid 2D-DCT and IDCT system that receives 
DCT input data and IDCT input data comprising: 

a first input multiplexer having a first input that receives 

the DCT input data and four outputs; 
a shuffler having four inputs coupled to said four outputs 

of said first input multiplexer and an output; 
a second input multiplexer having a first input coupled to 

said output of said shuffler and four outputs; and 
four sub-block operators, each having an input coupled in 

parallel to a respective output of said second input 

multiplexer and each sub-block operator having sixteen 

outputs. 

25. The hybrid system of claim 24, further comprising: 
four DCT clippers; 

four IDCT clippers; 

latches coupled to each sub-block operator output; and 
. an output multiplexer having inputs coupled to each of 
said latches and outputs coupled to said first input 
multiplexer and to said four DCT clippers; and 

wherein outputs of said shuffler are also coupled to said 
four [DCT clippers. 

26. The system of claim 24, wherein the hybrid 2D- DCT 
and IDCT system is implemented in hardware on a single 
VLSI chip. 

27. A method for switching between transforming 
2D-DCT data and 2D-IDCT data, comprising the steps of: 

switching first and second multiplexers to pass the 
2D-DCT input data through a shuffler then through four 
sub-block operators to obtain a matrix Z output data 
representing a 2D-DCT of the input 2D-DCT input 
data; and 

switching the first and the second multiplexers to pass the 
2D-IDCT input data through the four sub-block opera- 
tors and then the shuffler to obtain a matrix Z output 
data representing a 2D-IDCT of the input 2D-IDCT 
input data, 

wherein the input data comprises either video data or 
decoded video data, and wherein said first switching 
step is performed prior to encoding the video data and 
said second switching step is performed on the decoded 
video data after inverse scanning and quantization. 
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